Live Session

This is the live session work space for the course. Our goal with this repository, is that we’re able to communicate ahead of time our aims for each week, and that you can prepare accordingly.

Probability Spaces

source('./src/blank_lines.R')

Probability is a system of reasoning about the world in the face of incomplete information. In this course, we’re going to develop an understanding of the implications of core parts of this theory, how this theory was developed, and how these implications relate to every other part of the practice of data science.

probability, the final frontier

Learning Objectives

At the end of this week’s learning, student will be able to:

  1. Find and access all of the course materials;
  2. Develop a course of study that is builds toward success;
  3. Apply the axioms of probability to make a valid statement;
  4. Solve word problems through the application of probability and math rules.

Course Learning Objectives

At this point in the course, there is so much that is before us! As we settle in to study for the semester, it is useful to have a point of view of where we’re trying to go, and what we are going to see along the way.

Allow a justification by analogy:

Suppose that you decide that you would like to be a chef – all of the time watching cooking shows has revealed to you that this is your life’s true calling – and so you enroll in a culinary program.

One does not begin such a program by baking croissants and souffle. They begin the program with knife skills, breaking down ingredients and the basic techniques that build up to produce someone who is not a cook, but a chef – someone who can combine ingredients and techniques to produce novel ideas.

At the same time, however, one has not gone to school just to become a cucumber slicer. The knife skills are instrumental to the eventual goal – of being a chef – but not the goal itself.

At the beginning of the program, we’re teaching these core, fundamental skills. How to read and reason with mathematical objects, how to use conditional probability with the goal of producing a model, and eventually, eventually to create novel work as a data scientist.

At the end of this course, students will be able to:

Understand the building blocks of probability theory that prepare learners for the study of statistical models

  1. Understand the mathematical objects of probability theory and be able to apply their properties.
  2. Understand how high-level concepts from calculus and linear algebra are related to common procedures in data science.
  3. Translate between problems that are defined in business or research terms into problems that can be solved with math.

Understand and apply statistical models in common situations

  1. Understand the theory of statistics to prepare students for inferrential statements.
  2. Understand model parameters and high level strategies to estimate them: means, least squares, and maximum likelihood.
  3. Choose an appropriate statistic, and conduct a hypothesis test in the Neyman-Pearson framework.
  4. Interpret the results of a statistical test, including statistical significance and practical significance.
  5. Recognize limitations of the Neyman-Pearson hypothesis testing framework and be a conscientious participant in the scientific process

Analyze a research question using a linear regression framework

  1. Explore and wrangle data with the intention of understanding the information and relationships that are (and are not) present
  2. Identify the goals of your analysis
  3. Build a model that achieves the goals of an analysis

Interpret the results of a model and communicate them in manner appropriate to the audience

  1. Identify their audience and report process and findings in a manner appropriate to that audience.
  2. Construct regression oriented reports that provide insight for stakeholders.
  3. Construct technical documents of process and code for collaboration and reproducability with peer data scientists.
  4. Read, understand, and assess the claims that are made in technical, regression oriented reports

Contribute proficient, basic work, using industry standard tools and coding practices to a modern data science team.

Demonstrate programming proficiency by translating statistical problems into code. 1. Understand and incorporate best practices for coding style and data carpentry 2. Utilize industry standard tooling for collaboration

Introductions

Instructor Introductions

The instructors for the course come to the program, and to statistics from different backgrounds. Instructors hold PhDs in statistics, astrophysics, biology, political science, computer science, and information.

What does a statistician look like? You!

Identity shapes how people approach and understand their world.

We would like to acknowledge that we have limited diversity of identity among the instructors for this course. We each have been fortunate to be able to study, but we want to acknowledge that the education system in the US has systematically benefited the hegemonic groups and marginalized others voices.

Every one of the instructors shares a core identity as an empathetic educator that wants to understand your strengths, areas for growth, and unique point of view that is shaped by who you are. We want to see a field of data scientists who embrace each others voices, and respects people for the identies that they hold.

  • It doesn’t matter if you’ve never taken a stats class before, or if you’re reviewing using this class. There will be challenges for everyone to overcome.
  • It doesn’t matter how old or young you are. We will all be learning frequentist statistics which is timeless.
  • The color of your skin doesn’t matter; nor does whether you identify as a woman or a man or trans or non-binary; neither does your sexual orientation. There are legacies of exclusion and discrimination against people due to these identities. We will not continue to propagate those legacies and instead will work to controvert those discriminations to build a diverse community of learning in line with the University’s Principles of Community.

Student Introductions [Breakout One]

In a breakout room of between three and four students introduce yourself!

Breakout One. A name story is the unique, and individual story that describes how you came to have the name that you do. While there may be many people are called the same thing, each of their name stories is unique.

Please share: What is your name story?

Student Introductions [Breakout Two]

In the same breakout room:

Breakout Two. Like our names, the reasons that we joined this program, our goals and our histories are different.

Please share: What is your data science story? How did you wind up here, in this room today?

Probability Theory

Probability

Probability is a system of reasoning that we use to model the world under incomplete information. This model underlies virtually every other model you’ll ever use as a data scientist.

told you this would be spacey

In this course, probability theory builds out to random variables; when combined with sampling theory we are able to develop p-values (which are also random variables) and an inferential paradigm to communicate what we know and how certain a statement we can make about it.

In introduction to machine learning, literally the first model that you will train is a naive bayes classifier, which is an application of Bayes’ Theorem, trained using an iterative fitting algorithm. Later in machine learning, you’ll be fitting non-linear models, but at every point the input data that you are supplying to your models are generated from samples from random variables. That the world can be represented by random variables (which we will cover in the coming weeks) means that you can transform – squeeze and smush, or stretch and pull – variables to heighten different aspects of the variables to produce the most useful information from your data.

As you move into NLP, you might think of generative text as a conditional probability problem: given some particular set of words as an input, what is the most likely next word or words that someone might type?

Beyond the direct instrumental value that we see working with probability, there are two additional aims that we have in starting the course in the manner.

First, because we are starting with the axioms of probability as they apply to data science statistics, students in this course develop a much fuller understanding of classical statistics than students in most other programs. Unfortunately, it is very common for students and then professionals to see statistics as a series of rules that have to be followed absolutely and without deviation. In this view of statistics, there are distributions to memorize; there are repeated problems to solve that require the rote application of some algebraic rule (i.e. compute the sample average and standard deviation of some vector); and, there are myriad, byzantine statistical tests to memorize and apply. In this view of statistics, if the real-world problem that comes to you as a data scientist doesn’t clearly fit into a box, there’s no way to move forward.

Statistics like this is not fun.

In the way that we are approaching this course, we hope that you’re able to learn why certain distributions (like the normal distribution) arise repeatedly, and why we can use them. We also hope that because you know how sampling theory and random variables combine, that you can be more creative and inventive to solve problems that you haven’t seen before.

The second additional aim that we have for this course is that it can serve as either an introduction or a re-introduction to reading and making arguments using the language of math. For some, this will be a new language; for others, it may have been some years since they have worked with the language; for some, this will feel quite familiar. New algorithms and data science model advancements nearly always developed in the math first, and then applied into algorithms second. In our view, being a literate reader of graduate- and professional-level math is a necessary skill for any data scientist that is going to keep astride of the field as it continues to develop and these first weeks of the course are designed to bring everyone back into reading and reasoning in the language.

Axiomatic Probability

The book makes a point of defining our axioms of probability, calling them them

Kolmogorov Axioms

Let \(\Omega\) be a sample space, \(S\) be an event space, and \(P\) be a probability measure. Then, \((\Omega, S, P)\) is a probability space if it satisfies the following:

  • Non-negativity: \(\forall A \in S, P(A) \geq 0\), where \(P(A)\) is finite and real.
  • Unitarity: \(P(\Omega)=1\).
  • Countable additivity: if \(A_1, A_2, A_3, \dots \in S\) are pairwise disjoint, then

\[ P(A_1 \cup A_2 \cup A_3 \cup \dots) = P(A_1) + P(A_2) + P(A_3) = \sum_{i}P(A_{i}) \]

There is a lot going on in this definition!

First things first, these are the axioms of probability (read aloud in the booming voice of a god).

This means that these are things that we begin from, sort of the foundational principles of the entire system of reasoning that we are going to use. In the style of argument that we’re going to make, these are things that are sort of off-limits to question. Instead, these serve as the grounding assumptions, and we see what happens as we flow forward from these statements.

Second, and importantly, from these axioms there are a very large set of things that we can build. The first set of things that we will build are probability statements about atomic outcomes (Theorem 1.1.4 in the book), and collections of events. But, these statements, are not the only thing that we’re limited to. We can also build Frequentist Statistics, and Bayesian Statistics and Language Models.

In many ways, these axioms are the fundamental particles that hold our system of probabilistic reasoning together. These are to probability what the fermions and and bosons are to physics.

Definition vs. Theorem

What is the difference between a definition and a theorem? On pages 10 and 11 of the textbook, there is a rapid fire collection of pink boxes. We reproduce them here (notice that they may have different index numbers than the book – this live session book autoindexes and we’re not including every theorem and definition in this live session discussion guide).

Conditional Probability For \(A, B \in S\) with \(P(B) > 0\), the conditional probablity of \(A\) given \(B\) is \[P(A|B) = \frac{P(A\cap B)}{P(B)}.\]

Multiplicative Law of Probability For \(A, B \in S\) with \(P(B) > 0\), \[P(A|B)P(B) = P(A \cap B)\]

Baye’s Rule For \(A, B \in S\) with \(P(A) > 0\) and \(P(B) > 0\), \[P(A|B) = \frac{P(B|A)P(A)}{P(B)}.\]

  • What would happen to the statement of the Multiplicative Law of Probability if we did not have the definition of Conditional Probability?
  • How does one get from the definition, to the law?
  • Can one get to Baye’s Rule wihtout using the Multiplicative Law of Probability?

Working with a Sample Space

As a way to begin lets define terms that we will use for the next activities.

Group Discussion Question

  • What is the definition of a sample space?
  • What is the definition of an event?
  • How are sample spaces, and event spaces related?

Working with a Sample Space, Part I

  1. You roll two six-sided dice:
    1. How would you define an appropriate sample space, \(\Omega\)?
    2. How many elements exist in \(\Omega\)?
    3. What is an appropriate event space, and how many elements does it have?
    4. Give an example of an event.

 
 
 
 
 

Working with a Sample Space, Part II

  1. For a random sample of 1,000 Berkeley students:
    1. How would you define an appropriate sample space, \(\Omega\)?
    2. How big is \(\Omega\)? How many elements does it contain?
    3. What is an example of an event for this scenario?
    4. Can a single person be represented in the space twice? Why or why not?

 
 
 
 
 

Independence

The book provides a (characteristically) terse statement of what it means for two events to be independent of one another.

Independence of Events Events \(A, B \in S\) are independent if \[P(A \cap B) = P(A)P(B)\].

In your own words:

  • What does it mean for two events to be independent of one another?
  • How do you know if two events are independent of one another?
  • How do you test if two events are independent of one another?

Try using this idea of independent in two places:

  1. Suppose that you are creating a model to predict an outcome. Further, suppose that two events \(A\) and \(B\) are independent of one another. Can you use \(B\) to predict \(A\)?
  2. If two events, \(A\) and \(B\) are independent, then what happens if you work through a statement of conditional probability, \(P(A|B)\)?

A practice problem

The last task for us to complete today is working through a practice problem on the course practice problem website. Please, click the link below, and follow us over to the the course’s practice problem website.

link here

Student Tasks to Complete

Before next live session, please complete the homework that builds on this unit. There are two parts, an applied and a proof part. You can submit these homework as many times as you like before the due date (you will not receive feedback), and you can access this homework through bCourses.

The applied homework will be marked either Correct or Incorrect without partial credit applied. These are meant to be problems that you solve, and that have a single straightforward solution concept. The proof homework will be marked for partial credit (out of three points) that evaluates your argument for your solution concept.

Defining Random Variables

yosemite valley

Learning Objectives

At the end of this week’s course of study (which includes the async, sync, and homework) students should be able to

  1. Remember that random variable are neither random, or variables, but instead that they are a foundational object that we can use to reason about a world.
  2. Understand that the intuition developed by the use of set-theory probability maps into the more expressive space of random variables
  3. Apply the appropriate mathematical transformations to move between joint, marginal, and conditional distributions.

From the axioms of probability, it is possible to build a whole, expressive modeling system (that need not be grounded at all in the minutia of the world). With this probability model in place, we can describe how frequently events in the random variable will occur. When variable are dependent upon each other, we can utilize information that is encoded in this dependence in order to make predictions that are closer to the truth than predictions made without this information.

There is both a beauty and a tragedy when reasoning about random variables: we describe random variables using their joint density function.

The beauty is that by reasoning with such general objects – the definitions that we create, and the theorems that we derive in this section of the course – produce guarantees that hold in every case, no matter the function that stands in for the joint density function. We will compute several examples of specific functions to provide a chance to reason about these objects and how they “work”.

The tragedy is that in the “real world”, the world where we are going to eventually going to train and deploy our models, we are never provided with this joint density function. Perhaps this is the creation myth for probability theory: in a perfect world, we can produce a perfect result. But, in the “fallen” world of data, we will only be able to produce approximations.

Class Announcements

Homework

  1. You should have turned in your first homework. The solution set for this homework is scheduled to be released to you in two days. The solution set contains a full explanation of how we solved the questions posed to you. You can expect that feedback for this homework will be released back to you within seven days.
  2. You can start working on your second homework when we are out of this class.

Study Groups

It is a very good idea for you to create a recurring time to work with a set of your classmates. Working together will help you solve questions more effectively, quickly, and will also help you to learn how to communicate what you do and do not understand about a problem to a group of collaborating data scientists. And, working together with a group will help you to find people who share data science interests with you.

Course Resources

There are several resources to support your learning. A learning object last week was that you would be introduced to each of these systems. Please continue to make sure that you have access to the:

Using Definitions of Random Variables

Random Varaible

What is a random variable? Does this definition help you?

A random variable is a function \(X : \Omega \rightarrow \mathbb{R},\) such that \(\forall r \in \mathbb{R}, \{\omega \in \Omega\}: X(\omega) \leq r\} \in S\).

Someone, please, read that without using a single “omega”, \(\mathbb{R}\), or other jargon terminology. Instead, someone read this aloud and tell us what each of the concepts mean.

The goal of writing with math symbols like this is to be absolutely clear what concepts the author does and does not mean to invoke when they write a definition or a theorem. In a very real sense, this is a language that has specific meaning attached to specific symbols; there is a correspondence between the mathematical language and each of our home languages, but exactly what the relationship is needs to be defined into each student’s home language.

  • What are the key things that random variables allow you to accomplish?
    • Suppose that you were going to try to make a model that predicts the probability of winning “big money” on the machine. Big money might be that you get :cherries: :cherries: :cherries:. Can you do math with :cherries:?

Why do we say that random variables are functions? Is there some useful property of these being functions rather than any other quantity?

Functions of Functions

What about a function of a random variable, which is a function of a function.

Let \(g : U \rightarrow \mathbb{R}\) be some function, where \(X(\Omega) \subset U \subset \mathbb{R}\). Then, if \(g \circ X : \Omega \rightarrow \mathbb{R}\) is a random variable, we say that \(g\) is a function of X and write \(g(X)\) to denote the random variable \(g \circ X\).

If a random variable is a function from the real world, or the sample space, or the outcome space to a real number, then what does it mean to define a function of a random variable?

  • At what point does this function work? Does this function change the sample space that is possible to observe? Or, does this function change the real-number that each outcome points to?

Suppose that you are doing some image processing work. To keep things simple, that you are doing image classification in the style of the MNIST dataset.

  • Can someone describe what this task is trying to accomplish?
  • Has anyone done work like this?

However, suppose that rather than having good clean indicators for whether a pixel is on or off, instead you have weak indicators – there’s a lot of grey. A lot of the cells are marked in the range \(0.2 - 0.3\).

  1. How might creating a function that re-maps this grey into more extreme values help your model?
  2. Is it possible to “blur” events that are in the outcome space? Does this “blurring” meet the requirements of a function of a random variable, as provided above?

Pieces of a Random Variable

When you look at the definition of a random variable, there are two key pieces that must exist for every random variable. What are these pieces? As a hint, we actually developed one of these key pieces last week when we were discussing elementary probability.

A random variable is a function \(X : \Omega \rightarrow \mathbb{R},\) such that \(\forall r \in \mathbb{R}, \{\omega \in \Omega\}: X(\omega) \leq r\} \in S\).

Probability Density Functions and Cumulative Density Functions

  • What is a probability mass function?
  • What do the Kolmogorov Axioms mean must be true about any probability mass function (pmf)?

You should try driving in Berkeley some time. It is a trip! Without being deliberately ageist, the city is full of ageing hippies driving Subaru and making what seem to be stochastic right-or-left turns to buy incense, pottery, or just sourdough bread.

Suppose that you are walking to campus, and you have to cross 10 crosswalks, each of which are spaced a block apart. Further, suppose that as you get closer to campus, there are fewer aging hippies, and therefore, there is decreasing risk that you’re hit by a Subaru as you cross the street. Specifically, and fortunately for our math, the risk of being hit decreases linearly with each block that you cross.

Finally, campus provides you with the safety reports from last year, and reports that there were 120 student-Subaru incidents last year, out of 10,000 student-crosswalk crossings.

  1. What is the pmf for the probability that you are involved in a student-Subaru incident as you walk across these 10 blocks? What sample space, \(\Omega\) is appropriate to represent this scenario?
  2. Suppose that you don’t leave your house – this is a remote program after all! What is your cumulative probability of being involved in a student-subaru incident?
  3. What is the cumulative probability cmf for the probability that you are involved in a student-Subaru incident?
  4. Suppose that you live three blocks from campus, but your classmate lives five blocks from campus. What is the difference in the cumulative probability?
  5. How would you describe the cumulative probability of being hit as you walk closer to campus? That is, suppose that you start 10 blocks away from campus, and are walking to get closer. Is your cumulative probability of being hit on your way to campus increasing or decreasing as you get closer to campus?
  6. How would you describe the cumulative probability of being hit as you walk further from campus? That is, suppose that you start on campus, and you’re walking to a bar after classes. Is your cumulative probability of being hit on your way away from campus increasing or decreasing as you get further from campus?

Discrete & Continuous Random Variables

What, if anything is fundamentally different between discrete and continuous random variables? As a way of starting the conversation, consider the following cases:

  • Suppose \(X\) is a random variable that describes the time a student spends on w203 homework 1.
    • If you have only granular measurement – i.e. the number of nights spent working on the homework – is this discrete or continuous?
    • If you have the number of hours, is it discrete or continuous?
    • If you have the number of seconds? Or milliseconds?
  • Is it possible that \(P(X = a) = 0\) for every point \(a\)? For example, that \(P(X = 3600) = 0\).
  • Does one of these measures have more information in it than another?
    • How are measurement choices that we make as designers of information capture systems – i.e. the machine processes, human processes, or other processes that we are going to work with as data scientists – reflected in both the amount of information that is gathered, the type of information that is gathered, and the types of random variables that are manifest as a result?

Moving Between PDF and CDF

The book defines pmf and cmf first as a way of developing intuition and a way of reasoning about these concepts. It then moves to defining continuous density functions, which is many ways are easier to work with although they lack the means of reasoning about them intuitively. Continuous distributions are defined in the book, and more generally, in terms of the cdf, which is the cumulative density function. There are technical reasons for this choice of definition, some of which are signed in the footnotes on the page where the book presents it.

More importantly for this course, in Definition 1.2.15 the book defines the relationship between cdf and pdf in the following way:

For a continous random variable \(X\) with CDF \(F\), the probability density function of \(X\) is

\[ f(x) = \left. \frac{d F(u)}{du} \right|_{u=x}, \forall x \in \mathbb{R}. \]

  • How does this definition, which relates pdf and cdf by a means of differentiation and integration, fit with the ideas that we just developed in the context of walking to and from campus?

Suppose that you learn than a particular random variable, \(X\) has the following function that describes its pdf, \(f_{x}(x) = \frac{1}{10}x\). Also, suppose that you know that the smallest value that is possible for this random variable to obtain is 0.

  1. What is the CDF of \(X\)?
  2. What is the maximum possible value that \(x\) can obtain? How did you develop this answer, using the Kolmogorov axioms of probability?
  3. What is the cumulative probability of an outcome up to 0.5?
  4. What is the probability of an outcome between 0.25 and 0.75? Produce an answer to this in two ways:
  5. Using the \(pdf\)
  6. Using the \(cdf\)

Joint Density

Working with a single random variable helps to develop our understanding of how to relate the different features of a pdf and a cdf through differentiation and integration. However, there’s not really that much else that we can do; and, there is probably very little in our professional worlds that would look like a single random variable in isolation.

We really start to get to something useful when we consider joint density functions. Joint density functions describe the probability that both of two random variables. That is, if we are working with random variables \(X\) and \(Y\), then the joint density function provides a probability statement for \(P(X \cap Y)\).

In this course, we might typically write this joint density function as \(f_{X,Y}(x,y) = f(\cdot)\) where \(f(\cdot)\) is the actual function that represents the joint probability.

Uniform Joint Density

Suppose that we know that two variables, \(X\) and \(Y\) are jointly uniformly distributed within the the support \(x \in [0,4], y \in [0,4]\). We have a requirement, imposed by the Kolmogorov Axioms that all probabilities must be non-zero, and that the total probability across the whole support must be one.

  • Can you use these facts to determine answers to the following:
    • What kind of shape does this joint pdf have?
    • What is the specific function that describes this shape?
    • If you draw this shape on three axes, and \(X\), and \(Y\), and a \(P(X,Y)\), what does this plot look like?
    • How do you get from the joint density function, to a marginal density function for \(X\)?
    • How do you get form the joint density function, to a marginal denisty function for \(Y\)?
    • How do you get from these marginal density functions of \(X\) and \(Y\) back to the joint density? Is this alwasy possible?
knitr::include_app('http://www.statistics.wtf/PDF_Explorer/')

Saddle Sores

Suppose that you know that two random variables, \(X\) and \(Y\) are jointly distributed with the following pdf:

\[ f_{X,Y}(x,y) = \begin{cases} a * x^{2} * y^{2} & 0 < x < 1, 0 < y < 1 \\ 0 & otherwise \end{cases} \]

  • Can you use these facts to determine the following?
    • What value of \(a\) makes this a valid joint pdf?
    • What is the marginal pdf of \(x\)? That is, what is \(f_{x}(x)\)?
    • What is the conditional pdf of \(X\) given \(Y\)? That is, what is $f_{x|y}(x,y)?
    • Given these facts, would you say that \(X\) and \(Y\) are dependent or independent?

Visualizing Distributions Via Simulation

The Visualization Trick

Here is the true density function for a normal random variable.

Simulate Draws

There’s another way to get an approximate idea of what the distribution looks like. Here’s how we take a single draw from a normal distribution with a specific set of features:

rnorm(n = 1, mean = 2, sd = 1)
## [1] 2.071955

Repeating the Experiment

We want to rerun that experiment 10 times. We take a draw, then rewind time, clear our memory and start over with fresh randomness. To do this in R, an easy way is with the replicate() function. Change the code below so that it repeats the experiment above 10 times, then use hist() to display a plot of the result.

simulation <- replicate(
  n    = 10,      # should you change this line? 
  expr = 1 + .2   # or this line? 
  )

simulation
##  [1] 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2

Better Visualization

Here’s some fancy ggplot code to draw a nice histogram of the result, along with the true density. Remove the first line to make it work with your simulation.

short_simulation <-  c(1,2,3,2,4,2)

true_density <- function(x) {
  dnorm(x = x, mean = 2, sd = 1)
  }

dat_hist <- data.frame(short_simulation)

dat_hist %>% 
  ggplot() + 
  geom_histogram(
    aes(x = short_simulation, y = ..density..)) + 
  stat_function(
    aes(x = short_simulation), fun = true_density, color = 'darkred')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Repeating the Experiment: Questions

  • What happens to your plot as you increase the number of draws from 10 to 100 to 1000…?
  • In your own words, what is the difference between the distribution and the sample you are taking?

The Visualization Trick

  • This is a pretty useful trick.
  • The repetition we’re using has no analogue in the real world – we don’t get to shake up the world like a snow globe a number of times in a row to see what it does.
  • But, when we say “take a draw from the distribution” another way to say this is that we’re simulating the random variable.

Apply the Visualization Trick

Part I

  • How can the visualization trick help us? Here’s a problem:

    • Suppose \(X\) and \(Y\) are independent normal random variables, both with mean 2 and standard deviation 1. Say \(Z = X + Y\).
    • What is the distribution of \(Z\)?
  • We could do some math to compute the density function of \(Z\), but it’s actually quite messy. Instead, let’s use the visualization trick to get an approximate idea.

Part II

  • First, write an R function to simulate a single draw from \(Z\).
  1. Simulate a draw for \(X\).
  2. Simulate a draw \(Y\).
  3. Return the sum of the previous draws.
rz <- function() {
  2 # Replace with your code
}
  • Use the previous code to repeat this experiment 10,000 times and plot a histogram.
  • See if you can guess what the the distribution is, and plot your guess on the histogram.

How is simulation useful?

  • Are there situations that you think simulation of this sort might be useful?

Computing Different Distributions.

Suppose that random variables \(X\) and \(Y\) are jointly continuous, with joint density function given by,

\[ f(x,y) = \begin{cases} c, & 0 \leq x \leq 1, 0 \leq y \leq x \\ 0, & otherwise \end{cases} \]

where \(c\) is a constant.

  1. Draw a graph showing the region of the X-Y plane with positive probability density.
  2. What is the constant \(c\)?
  3. Compute the marginal density function for \(X\). (Be sure to write a complete expression)
  4. Compute the conditional density function for \(Y\), conditional on \(X=x\). (Be sure to specify for what values of \(x\) this is defined)

Understanding Joint Distributions

  • In this picture, we imagine putting a cake down on the X-Y plane.
  • Take a sharp knife and make two cuts parallel to the X-axis. one is at \(Y = y\), the other at \(Y = y + dy\).

cake

Review of Terms

Remember some of the key terms we learned in the async:

  • Joint Density Function
  • Conditional Distribution
  • Marginal Distribution

Explain each of these three in terms of the cake metaphor.

Working with a Shiny App

knitr::include_app(url = 'http://statistics.wtf/betahat/', height = '1200px')

Summarizing Distributions

a majestic valley

Learning Objectives

At the end of the live session and homework this week, students will be able to

  1. Understand the importance of thinking in terms of random variables, while;
  2. Appreciating that it is not typically possible to fully model the world with a single function.
  3. Articulate why we need a target for a model, and propose several possible such targets.
  4. Produce summaries of location and relationship given a particular functional form for a random variable.

Class Announcements

Where have we come from, and where are we going?

What is in the rearview mirror?

  • Statisticians create a population model to represent the world; random variables are the building blocks of such a model.
  • We can describe the distribution of a random variable using:
    • A CDF for all random variables
    • A PMF for discrete random variables
    • A PDF for continuous random variables
  • When we have multiple random variables,
    • The joint PMF/PDF describes how they behave together
    • The marginal PMF/PDF describes one variable in isolation
    • The conditonal PMF/PDF describes one variable given the value of another

Today’s Lesson

What might seem frustrating about this probability theory system of reasoning is that we are building a castle in the sky – a fiction. We’re supposing that there is some function that describes the probability that values are generated. In reality, there is no such generative function; it is extremely unlikely (though we’ll acknowledge that it is possible) that the physical reality we belive we exist within is just a complex simulation that has been programmed with functions by some unknown designer.

Especially frustrating is that we’re supposing this function, and then we’re further saying,

“If only we had this impossible function; and if only we also had the ability to take an impossible derivative of this impossible function, then we could…”

Single number summaries of a single random variable

But, here’s the upshot!

What we are doing today is laying the baseline for models that we will introduce next week. Here, we are going to suggest that there are radical simplifications that we can produce that hold specific guarantees, no matter how complex the function that we’re reasoning about.

In particular, in one specific usage of the term best we will prove that the Expectation operation is the best one-number summary of any distribution. To do so, we will define a term, variance, which is the squared deviations from the expectation of a variable that describes how “spread out” is a variable. Then, we will define a concept that is the mean squared error that is the square of the distance between a model prediction and a random variable’s realization. The key realization is that when the model predicts the expectation, then the MSE is equal to the variance of the random variable, which is the smallest possible value it could realize.

Single number summaries of relationships between random variables

Although the single number summaries are incredibly powerful, that’s not enough for today’s lesson! We’re also going to suggest that we can create a measure of linear dependence between two variables that we call the “covariance”, and a related, rescaled version of this relationship that is called the correlation.

Future Attractions

  • A predictor is a function that provides a value for one variable, given values of some others.
  • Using our summary tools, we will define a predictor’s error and then minimize it.
  • This is a basis for linear regression

Discussion of Terms

Together with your instructor, you will talk about what each of the following definitions mean in your own words, for the key concepts, you might also formalize this intuition into a formula that can be computed.

  • Expected Value, or Expectation
  • Central Moments \(\rightarrow\) Variance \(\rightarrow\) Standard Deviation
  • Set aside for later: Chebyshev’s Inequality and the Normal Distribution
  • Mean Squared Error and its alternative formula
  • Covariance and Correlation

Computing Examples

Expected Value of a Die [discrete random variable]

  • The expected value (or population mean) of a discrete random variable \(X\) is the weighted average of the values in the range of \(X\).
  • Suppose that \(X\) represents the number of years of education that someone has completed, and so has a support that ranges from \(0\) years of education, up to \(28\) years of education. (Incidentally, Mark Labovitz has about 28 years of education).

without using specific numbers, describe the process you would use to calculate the expected value of this distribution.

Using a formula

How does the following formula match with your concept of expectation from that last section?

\[ E(X) = \sum_{x \in \{years\}} x \cdot P(X=x) \]

Computing by Hand

Compute the Expected Value

Let \(X\) represent the result of one roll of a 6 sided die.

  • Calculating by hand, what is the expected value \(X\)?

Fill this in by hand.

Compute the Variance

Let \(X\) represent the result of one roll of a 6 sided die.

  • Calculating by hand, what is the variance of \(X\)?

Expected Value by Code

Expected Value of a Six-Sided Die

Let \(X\) represent the result of one roll of a 6 sided die.

  • Build an object to represent the whole sample space, \(\Omega\) of a six sided die.
  • Determine what probabilities to assign to each value of that object.
  • Write the code to run the expectation algorithm that you just performed by hand.
die <- data.frame(
    value = 'fill this in',
    prob  = 'fill this in'
  )

Variacne of a Six-Sided Die

Let \(X\) represent the result of one roll of a 6 sided die.

  1. Using what you know about the definition of variance, write a function that will compute the variance of your die object.
variance_function <- function(die) { 
  ## fill this in
  mu = 'fill this in'
  var = 'fill this in'
  
  return(var)
}

variance_function(die)
## [1] "fill this in"

Suppose that you had to keep the values the same on the die, that is. That is the domain of the outcome still had to be the countable set of integers from one to six.

  • How would you change the probability distribution to decrease the variance of this random variable?
  • What is the smallest value that you can generate for this random variable? Use the variance_function from above to actually compute this variance.
  • What is the largest value of variance that you can generate for this random variable? Use the variance_function from above to actually compute this variance.

Now suppose that you again had an equal probability of every outcome, but you were to apply a function to the number of spots that are showing on the die. Rather that each dot contributing one value to the random variable, instead the random variable’s outcome is the square of the number of spots.

  • How would this change the mean?
  • How would this change the variance?

Practice Computing

Single Variable

Suppose that \(X\) has the following density function:

\[ f_{X}(x) = \begin{cases} 6x(1 - x), & 0 < x < 1 \\ 0, & otherwise \\ \end{cases} \]

  • Find \(E[X]\).
  • Find \(E[X^2]\).
  • Find \(V[X]\).

Joint Density

Suppose that \(X\) and \(Y\) have joint density \(f_{X,Y}(x,y) = 10 x^2y\) for $0 < y < x < 1. $

  • Find \(E[X]\)
  • Find \(E[EY]\)
  • Find \(\rho[X,Y]\)

Conditional Expectation and The BLP

mt. tamalpais

One of our most fundamental goals as data scientists is to produce predictions that are good. In this week’s async, we make a statement of performance that we can use to evaluate how good a job a predictor is doing, choosing Mean Squared Error.

With the goal of minimizing MSE, then we then present, justify, and prove that the conditional expectation function (the CEF) is the globally best possible predictor. This is an incredible powerful result, and one that serves as the backstop for every other predictor that you will ever fit, whether that predictor is a “simple” regression, or that predictor is a machine learning algorithms (e.g. a random forest) or a deep learning algorithm. Read that again – even the most technologically advanced algorithms cannot possibly perform better than the conditional expectation function.

Why does the CEF do so well? Because it can contain a vast amount of complex information and relationships; in fact, the complexity of the CEF is a product of the complexity of the underlying probability space. If that is the case, then why don’t we just use the CEF as our predictor every time?

Well, this is one of the core problems of applied data science work: we are never given the function that describes the behavior of the random variable. And so, we’re left in a world where we are forced to produce predictions from simplifications of the CEF. A very strong simplification, but one that is useful for our puny human brains, is to restrict ourselves to predictors that make predictions from a linear combination of input variables.

Why should we make such a strong restriction? After all, the conditional expectation function might be a fantastically complex combination of input features, why should we entertain functions that are only linear combinations? Essentially, this is because we’re limited in our ability to reason about anything more complex than a linear combination.

Thunder Struck

thunder struck

Learning Objectives

At the end of this weeks learning, which includes the asynchronous lectures, reading the textbook, this live session, and the homework associated with the concepts, student should be able to

  1. Recognize that the conditional expectation function, the CEF, is a the pure-form, best-possible predictor of a target variable given information about other variables.
  2. Recall that all other predictors, be they linear predictors, non-linear predictors, branching predictors, or deep learning predictors, are an attempt to approximate the CEF.
  3. Produce the conditional expectation function as a predictor, given joint densities of random variables.
  4. Appreciate that the best linear predictor, which is a restriction of predictors to include only those that are linear combinations of variables, can produce reasonable predictions, and anticipate that the BLP forms the target of inquiry for regression.

Class Announcements

Test 1 is releasing to you today.

The first test is releasing today. There are review sessions scheduled for this week, practice tests available, and practice problems available. The format for the test is posted in the course discussion channel. In addition to your test, your instructor will describe your responsibilities that are due next week.

Roadmap

Rearview Mirror

  • Statisticians create a population model to represent the world.
  • \(E[X], V[X], Cov[X,Y]\) are “simple” summaries of complex joint distributions, which are hooks for our analyses.
  • They also have useful properties – for example, \(E[X + Y] = E[X] + E[Y]\).

This week

  • We look at situations with one or more “input” random variables, and one “output.”
  • Conditional expectation summarizes the output, given values for the inputs.
  • The conditional expectation function (CEF) is a predictor – a function that yields a value for the output, give values for the inputs.
  • The best linear predictor (BLP) summarizes a relationship using a line / linear function.

Coming Attractions

  • OLS regression is a workhorse of modern statistics, causal analysis, etc
    • It is also the basis for many other models in classical stats and machine learning
  • The target that OLS estimates is exactly the BLP, which we’re learning about this week.

Conditional Expectation Function (CEF),

Part I

Think back to remember the definition of the expectation of \(Y\):

\[ E[Y] = \int_{-\infty}^\infty y \cdot f_{Y}(y) dy \]

This week, in the async reading and lectures we added a new concept, the conditional expectation of \(Y\) given \(X=x \in \text{Supp}[X]\):

\[ E[Y|X=x] = \int_{-\infty}^\infty y \cdot f_{Y|X}(y|x) dy \]

Part II

  1. What desirable properties of a predictor does the expectation possess (note, this is thinking back by a week)? What makes these properties desirable?
  2. Turning to the content from this week, how, if at all, does the conditional expectation improve on these desirable properties?

Part III

  • Compare and contrast \(E[Y]\) and \(E[Y|X]\). For example, when you look at how these operators are “shaped”, how are their components similar or different?2

  • What is \(E[Y|X]\) a function of? What are “input” variables to this function?

  • What, if anything, is \(E[E[Y|X]]\) a function of?

Computing the CEF

  • Suppose that random variables \(X\) and \(Y\) are jointly continuous, with joint density function given by,

\[ f(x,y) = \begin{cases} 2, & 0 \leq x \leq 1, 0 \leq y \leq x \\ 0, & otherwise \end{cases} \]

What does the joint PDF of this function look like?

To begin with, let’s compute the simplest quantities:

  • What is the expectation of \(X\)?
  • What is the expectation of \(Y\)?
  • How would you compute the variance of \(X\)? (We’re not going to do it live).

And then, let’s think about how to compute the conditional quantites:

  • What is the conditional expectation of \(Y\), given that \(X=x=0\)?
  • What is the conditional expectation of \(Y\), given that \(X=x=0.5\)?
  • What is the conditional expectation of \(X\), given that \(Y=y=0.5\)?
  • Which of the two of these has a lower conditional variances?
    • \(V[Y|X=0.25]\); or,
    • \(V[Y|X=0.75]\).
  • How does \(V[Y]\) compare to \(V[Y|X=1]\)? Which is larger?

Conditional Functionals

Suppose that you were told that \(X\) and \(Y\) are jointly distributed according to the following joint density function:

\[ f_{X,Y}(x,y) = \begin{cases} 2 & 0 \leq x \leq 1, 0 \leq y \leq x \\ 0 & otherwise \end{cases} \]

Conditional Expectation

What is the conditional expectation function? To get started, you can use the fact that in week two, we already computed the conditional probability density function:

\[ f_{Y|X}(y|x) = \begin{cases} \frac{1}{x}, & 0 \leq y \leq x \\ 0, & \text{otherwise.} \end{cases} \]

 
 
 
 
 
 
 
 
 
 

Conditional Variance

  • What is the conditional variance function?3

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Breakout Activity

Minimizing MSE

Theorem 2.2.20 states,

The CEF \(E[Y|X]\) is the “best” predictor of \(Y\) given \(X\), where “best” means it has the smallest mean squared error (MSE).

Oh yeah? As a breakout group, ride shotgun with us as we prove that the conditional expectation is the function that produces the smallest possible Mean Squared Error.

Specifically, you group’s task is to justify every transition from one line to the next using concepts that we have learned in the course: definitions, theorems, calculus, and algebraic operations.

The pudding (aka: “Where the proof is”)

We need to find such function \(g(X): \mathbb{R} \to \mathbb{R}\) that gives the smallest mean squared error.

First, let MSE be defined as it is in Definition 2.1.22.

For a random variable \(X\) and constant \(c \in \mathbb{R}\), the mean squared error of \(X\) about \(c\) is \(E[(x-c)^2]\).

Second, let us note that since \(g(X)\) is just a function that maps onto \(\mathbb{R}\), that for some particular value of \(X=x\), \(g(X)\) maps onto a constant value.

  • Deriving a Function to Minimize MSE

\[ \begin{aligned} E[(Y - g(X))^2|X] &= E[Y^2 - 2Yg(X) + g^2(X)|X] \\ &= E[Y^2|X] + E[-2Yg(X)|X] + E[g^2(X)|X] \\ &= E[Y^2|X] - 2g(X)E[Y|X] + g^2(X)E[1|X] \\ &= (E[Y^2|X] - E^2[Y|X]) + (E^2[Y|X] - 2g(X)E[Y|X] + g^2(X)) \\ &= V[Y|X] + (E^2[Y|X] - 2g(X)E[Y|X] + g^2(X)) \\ &= V[Y|X] + (E[Y|X] - g(X))^2 \\ \end{aligned} \]

Notice too that we can use the Law of Iterated Expectations to do something useful. (This is a good point to talk about how this theorem works in your breakout groups.)

\[ \begin{aligned} E[(Y-g(X))^2] &= E\big[E[(Y-g(X))^2|X]\big] \\ &=E\big[V[Y|X]+(E[Y|X]-g(X))^2\big] \\ &=E\big[V[Y|X]\big]+E\big[(E[Y|X]-g(X))^2\big]\\ \end{aligned} \]

  • \(E[V[Y|X]]\) doesn’t depend on \(g\); and,
  • \(E[(E[Y|X]-g(X))^2] \geq 0\).

\(\therefore g(X) = E[Y|X]\) gives the smallest \(E[(Y-g(X))^2]\)

The Implication

If you are choosing some \(g\), you can’t do better than \(g(x) = E[Y|X=x]\).

Working with the BLP

Why Linear?

  • In some cases, we might try to estimate the CEF. More commonly, however, we work with linear predictors. Why?

  • We don’t know joint density function of \(Y\). So, it is “difficult” to derive a suitable CEF.

  • To estimate flexible functions requires considerably more data. Assumptions about distribution (e.g. a linear form) allow you to leverage those assumptions to learn ‘more’ from the same amount of data.

  • Other times, the CEF, even if we could produce an estimate, might be so complex that it isn’t useful or would be difficult to work with.

  • And, many times, linear predictors (which might seem trivially simple) actually do a very good job of producing predictions that are ‘close’ or useful.

##Joint Distribution Practice

Professorial Mistakes (Discrete RVs)

  • Let the number of questions that students ask be a RV, \(X\).

  • Let \(X\) take on values: \(\{1, 2, 3\}\), each with probability \(1/3\).

  • Every time a student asks a question, the instructor answers incorrectly with probability \(1/4\), independently of other questions.

  • Let the RV \(Y\) be number of incorrect responses.

  • Questions:

    • Compute the expectation of \(Y\), conditional on \(X\), \(E[Y|X]\)
    • Using the law of iterated expectations, compute \(E[Y] = E\big[E[Y|X]\big]\).

Learning from Random Samples

south hall

Goals, Framework, and Learning Objectives

Class Announcements

  • You’re done with probability theory. Congrats!
  • You’re also done with your first test. Congrats!
  • We’re going to have a second test in a few weeks; then we’re done testing for the semester

Learning Objectives

At the end of this week, students will be able to

  1. Understand what iid sampling is, and evaluate whether the assumption of iid sampling is sufficiently plausible to engage in frequentist modeling.
  2. Appreciate that with iid sampling, summarizing functions of random variables are, themselves, random variables with probability distributions and values that they obtain.
  3. Recall the definition of an estimator,
  4. Recall definition of an estimator, state and understand the desirable properties of estimators, and evaluate whether an estimator possesses those desirable properties.
  5. Distinguish between the concepts of {expectation & sample mean}, {variance & unbiased sample variance estimator, sampling-based variance in the sample mean}.

Roadmap

Where We’re Going – Coming Attractions

  • We’re going to start bringing data into our work
  • First, we’re going to develop a testing framework that is built on sampling theory and reference distributions – t.tests, wilcox.test and the like
  • Second, we’re going to show that OLS regression is the sample estimator of the BLP
  • Third, we’re going to use the testing distribution to test regression coefficients

Where We’ve Been – Random Variables and Probability Theory

  • Statisticians create a model (A.K.A. population) to represent the world.
  • That model can be described by parameters like expecation, covariance.
  • So far, these parameter values have come from our imaginations

Where we Are

  • We want to fit models – use data to set their parameter values.
  • A sample is a set of random variables
  • Sample statistics are functions of a sample, and they are random variables
  • Under iid and other assumptions, we get useful properties:
    • Statistics may be consistent estimators for population parameters
    • The distribution of sample statistics may be asymptotically normal

Key Terms and Assumptions

IID

For each scenario, is the IID assumption plausible?

  • Call a random phone number. If someone answers, interview all persons in the household. Repeat until you have data on 100 people.
  • Call a random phone number, interview the person if they are over 30. Repeat until you have data on 100 people.
  • Record year-to-date price change for 20 largest car manufacturers.
  • Measure net exports per GDP for all 195 countries recognized by the UN.

Definitions

Define each of the following:

  • Sample
  • Sample Statistic
  • Estimator
  • Bias
  • Efficiency
  • Consistency
  • Convergence in Probability
  • Convergence in Distribution

Understanding Sampling Distributions

  • Let \(X\) be a Bernoulli random variable representing an unfair coin with \(P(X=1) = 0.7\).
  • You have an iid sample of size 2, \((X_1,X_2)\).
  • Compute the sampling distribution of \(\overline X = \frac{X_1+X_2}{2}\).

Questions:

  • Explain the difference between a population distribution and the sampling distribution of a statistic.
  • As we toss more and more coins, \(\overline X_{(100)} \rightarrow \overline X_{(10000)}\) what will the value of \(\overline X\) get closer to? What law generates this, and why does this law generate this result?
  • Why do we want to know things about the sampling distribution of a statistic?

Uncertainty

Which Result is Better?

  • Suppose that you measure salary data among individuals who try different strategies
  • Report out in the following table:
Early Rising Mindfulness Retreat MIDS Degree
Increase in Salary $1020 $5130 $9200
\(SE\) ($350) ($4560)
\(N =\) 1,000 77 700

(Standard errors in parentheses when available)

Standard Errors

Errors in stating standard errors frequently occur. (We wanted to start with what might the most vapid, but also confusing statement possible!)

Standard errors are a statement about the sampling variance of the sample average. But, related to this concept are the ideas of the Population Variance, the Plug-In Estimator for the Sample Variance, the Unbiased Sample Variance, and, finally, the Sampling Variance of the Sample Average (i.e the Standard Error).

How are each of these concepts related to one another, and how can we keep them all straight? As a group, fill out the following columns?

Population Concept Sample Estimator Estimator Properties Sampling Variance of Sample Estimator
Expected Value
Population Variance
Population Covariance
CEF
BLP

Write Code to Demo the Central Limit Theorem (CLT)

Motivating the Central Limit Theorem (CLT)

  • Standard Errors tell us a lot about the uncertainty in our statistics
  • But we want to say more:
    • How confident are we that this vitamin has a positive effect?
    • How plausible is a mean income $1000 below our estimate?
  • For these questions, we need to know the sampling distribution of our statistic.
  • How is this possible when we don’t know the population distribution?

Sampling from the Bernoulli Distribution in R

  • To demonstrate the CLT, we chose a Bernoulli distribution with parameter \(p\).
    • This distribution is very simple
    • This distribution is non-normal, and can be very skewed depending on \(p\).
  • First, set p=0.5 so your population distribution is symmetric. Use a variable n to represent your sample size. Initially, set n=3.
n <- 3
p <- 0.5

Useful R Commands

sample() or rbinom()

  • R doesn’t have a bernoulli function.
  • To simulate draws from a Bernoulli variable, you can either:
    • Use sample
    • Or, use rbinom (the Bernoulli distribution is a special case of a binomial distribution. In this function, size refers to a distribution parameter, not the number of draws.)
sample(x=0:1, size=n, replace=TRUE, prob=c(1-p, p))
rbinom(n=n, size=1, prob=p)
## [1] 0 0 0
## [1] 0 0 0

replicate()

  • To repeat an action, you can use replicate
replicate(10, log(10))
##  [1] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
ggplot() + 
  aes(x = rnorm(100)) + 
  geom_histogram(bins = '8') 

Exercise

Part 1

Throughout this part, we will use fair coins (p = 0.5).

  1. Fill in the function below so that it simulates taking n draws from a Bernoulli distribution with parameter p. This is like tossing n coins at the same time. Use the mean function to compute the sample mean – the average of the number of heads that are showing. Make sure that when you run it, you return values in \(\{0,1/3,2/3,1\}\).
experiment <- function(n, p){

  }
  1. The sample mean is a random variable. To understand it, use the visualization trick from a few weeks ago. Use the replicate function to run the above experiment 1000 times, and plot a histogram of the results.

  2. If you replicate the experiment enough times, will the distribution ever look normal? Why or why not?

  3. Use sd() to check the standard deviation of the sampling distribution of the mean for number_of_coins = 3. What sample size is needed to decrease the standard deviation by a factor of 10? Check that your answer is correct.

Part 2

For this part, we’ll continue to study a fair coin.

  1. Try different values for the sample size n, and examine the shape of the sampling distribution of the mean. At what point does it look normal to you?

Part 3

For this part, we’ll study a very unfair coin. p = 0.01.

This is an example of a highly skewed random variable. That roughly means that one tail is a lot longer than the other.

For this activity, you can simply use your eyes to gauge how skewed a distribution is. If you prefer, you can also use the skewness command in the univar package to measure skewness. You may hear a rule of thumb that a skewness above 1 or below -1 is a highly skewed distribution.

  1. Start with n=3 as before. What do you notice about the shape of the sampling distribution?

  2. Try different values for the sample size n, and examine the shape of the sampling distribution of the mean. At what point does it look normal to you?

Discussion Questions

  1. How does the skewness of the population distribution affect the applicability of the Central Limit Theorem? What lesson can you take for your practice of statistics?

  2. Name a variable you would be interested in measuring that has a substantially skewed distribution.

  3. One definition of a heavy tailed distribution is one with infinite variance. For example, you can use the rcauchy command in R to take draws from a Cauchy distribution, which has heavy tails. Do you think a “heavy tails” distribution will follow the CLT? What leads you to this intuition?

Hypothesis Testing

Frequentist Hypothesis testing is a very well established framework in the applied practice, and scientific literature. Sometimes (often, currently) referred to as Null Hypothesis Significance Testing, this framework essentially makes an absurd assumption and asks the data to overturn that absurd assumption.

Like a petulant child, NHST essentially proclaims,

“Oh yeah? Well, if you actually loved me then you would buy me an iPad!”

Which is to say, it says, “Oh yeah? If this absurd thing isn’t true, prove it to me!”

update!

What is Frequentist testing doing?

This testing framework works on samples of data, and applies estimators to produce estimates of population parameters that are fundamentally unknown and unknowable. Despite this unknown and unknowable population target, with some carefully written down estimators we can rely on the convergence characteristics of some estimators to produce useful, reliable results.

We begin with the one-sample t-test. The one-sample t-test relies on the sample average as an estimator of a population expectation. In doing so, it relies on the effectiveness of the Weak Law of Large Numbers and the Central Limit Theorem to guarantee that the estimator that converges in probability to the population expectation, while also converging in distribution to a Gaussian distribution.

These two convergences permit a data scientist to make several inferences based on data:

  1. The probability of generating data that “looks like what is observed”, if the null-hypothesis were true. This is often referred to as the p-value of a test, and is the petulant statement identified above.
  2. An interval of values that, with some stated probability (e.g. 95%), contains the true population parameter.

This framework begins a exceedingly important task that we must understand, and undertake when we are working as data scientists: Producing our best-estimate, communicating how we arrived at that estimate, what (if any) guarantees that estimate provides, and crucially all limitations of our estimate.

Learning Objectives

  1. Understand the connection between random variables, sampling, and statistical tests.
  2. Apply the Frequentist testing framework in a simple test – the one-sample t-test.
  3. Anticipate that every additional Frequentist test is a closely related variant of this test.

Class Announcements

  1. You will be taking your second test this week. The test follows the same format as the first test, which we will discuss in live session. The test will cover Unit 4: Conditional Expectation and the Best Linear Predictor and Unit 5: Learning from Random Samples.
  • Like the first test, our goal is to communicate to you what concepts we think are important, and then to test those concepts directly, and fairly. The purpose of the test is to give you an incentive to review what you have learned through probability theory, and then to demonstrate that you can produce work based on that knowledge.
  • There is another practice test on Gradescope, and in the GitHub repository.
  1. In rosier news, we’re moving out of the only pencil and paper section of this course, and bringing what we have learned out into the dirty world of data. This means a few things:
  • If you haven’t yet worked through the R Bridge Course that is available to you, working on this bridge course will be useful for you (after you complete your test). The goal of the course is to get you up and running with reasonably successful code and workflows for the data-based portion of the course.
  1. We will assign teams, and begin our work on Lab 1 in Live Session next week. This is a two-week, group lab that you will work on with three total team-mates. The lab will cover some of the fundamentals of hypothesis tests,

Roadmap

Looking Backwards

  • Statisticians create a model to represent the world
  • We saw examples of estimators, which approximate model parameters we’re interested in.
  • By itself, an estimate isn’t much good; we need to capture the uncertainty in the estimate.
  • We’ve seen two ways to express uncertainty in an estimator: standard errors and confidence intervals.

Today

  • We introduce hypothesis testing
    • A hypothesis test also captures uncertainty, but in relation to a specific hypothesis.

Looking Ahead

  • We’ll build on the one-sample t-test, to introduce several other statistical tests.
  • We’ll see how to choose a test from different alternatives, with an eye on meeting the required assumptions, and maximizing power.

What does a hypothesis test do?

  • What are the two possible outcomes of a hypothesis test?
  • What are the four-possible combinations of (a) hypothesis test rest; and (b) state of the world?
  • Does a hypothesis test always have to report a result that is consistent with the state of the world? What does it mean if it does, and what does it mean if it does not.
  • What if you made up your own testing framework, called the {Your Last Name’s} Groundhog test. Which is literally a groundhog looking to see its shadow. Because you made the test, suppose that you know that it is totally random whether a groundhog sees its shadow. How useful would this test be at separating states of the world?
  • What guarantee do you get if you follow the decision rules properly?
  • Why do we standardize the mean to create a test statistic?

\[ t = \frac{ \overline{X}_n - \mu}{\sqrt{\frac{s^2}{n}}} \]

Madlib prompt

tone_of_voice     <- 'yell'
mode_of_speech    <- 'call'
superlative       <- 'worst'
score_on_test     <- '70'
name_of_classmate <- 'victor'
emotion           <- 'ennui'
eating_verb       <- 'smack' #slurp
vessel            <- 'bowl'
thing_found_in_compost <- 'bannana peel'

Madlib completed

Suppose that a classmate comes to you, and, in a yell voice call, “Hey, I’ve got something that is worst for statistics. All you’ve got to do to get 70 on Test 2 and make victor ennui is smack this bowl of bannana peel.

You’re skeptical, but also curious because that last test was tough. Good.

Acknowledging that we’re only 40% of the way through the course, you decide to hire a hungry, underpaid PhD student the School to conduct the experiment to evaluate this claim. They report back, no details about the test, but they do tell you, “We’re sure there’s no effect of bannana peel.”

  • Do you believe them?
  • What, if any, reasons can you imagine not to believe this conclusion?

All joking aside

Explain this joke:

har har har

Manual Computation of a t-Test

In a warehouse full of power packs labeled as 12 volts we randomly measured the voltage of 7. Here is the data:

voltage <- c(11.77, 11.90, 11.64, 11.84, 12.13, 11.99,  11.77)
  1. Find the mean and the standard deviation.

  2. Using qt(), compute the t critical value for a hypothesis test for this sample.

  3. Define a test statistic, \(t\), for testing whether the population mean is 12.

  4. Calculate the p-value using the t statistic.

  5. Should you reject the null? Argue this in two different ways. (Following convention, set \(\alpha = .05\).)

  6. Suppose you were to use a normal distribution instead of a t-distribution to test your hypothesis. What would your p-value be for the z-test?

  7. Without actually computing it, say whether a 95% confidence interval for the mean would include 12 volts.

  8. Compute a 95% confidence interval for the mean.

Data Exercise

t-Test Micro Cheat Sheet

  • Key t-Test Assumptions
    • Metric variable
    • IID
    • No major deviations from normality, considering sample size

Testing the Home Team Advantage

The file athlet2.Rdata contains data on college football games. The data is provided by Wooldridge and was collected by Paul Anderson, an MSU economics major, for a term project. Football records and scores are from 1993 football season.

load("data/athlet2.RData")
data
##    dscore dinstt doutstt htpriv vtpriv dapps htwrd vtwrd dwinrec dpriv
## 1      10   -409   -4679      0      0 -1038     1     1       0     0
## 2     -14     NA     -66      0      0 -7051     1     1       0     0
## 3      23   -654    -637      0      0  6209     1     0       1     0
## 4       8   -222     456      0      0  -129     1     1       0     0
## 5     -12    -10     208      0      0   794     1     1       0     0
## 6       7    494      17      0      0   411     0     0       0     0
## 7     -21      2       2      0      0 -4363     1     1       0     0
## 8      -5     96    -333      0      0  1144     1     0       1     0
## 9      -3    223    2526      0      0  3956     0     0       0     0
## 10    -32    -20       0      0      0  -641     0     1      -1     0
## 11      9     66       0      0      0  -278     1     0       1     0
## 12      1     56    -346      0      0 -2223     1     0       1     0
## 13      7    556     717      0      0 -5217     1     0       1     0
## 14    -20    169    -461      0      0  1772     0     1      -1     0
## 15     35   -135     396      0      0    85     1     0       1     0
## 16     35    -40       0      0      0  -988     1     0       1     0
## 17    -25     24       0      0      0 -8140     1     1       0     0
## 18     -9     90       0      0      0  8418     0     1      -1     0
## 19    -33     27     900      0      0 -3273     0     0       0     0
## 20      7    -89     -31      0      0  1906     1     0       1     0
## 21     -3    536    2352      0      0  -151     1     1       0     0
## 22     -6  13261    9111      1      0 -9936     1     1       0     1
## 23    -29  13809   10076      1      0 -6265     0     1      -1     1
## 24     14 -17631  -10589      0      1  1252     1     0       1    -1
## 25    -18  14885    9983      1      0 -4529     1     1       0     1
## 26     48 -15220  -11400      0      1  -318     1     0       1    -1
## 27     -3     99     -29      0      0  -797     0     1      -1     0
## 28     -3    -54     -88      0      0  -372     0     1      -1     0
## 29     -3    -98   -4175      1      0  2460     1     1       0     1
## 30      2   -304    2987      0      1 -3035     1     1       0    -1

We are especially interested in the variable, dscore, which represents the score differential, home team score - visiting team score. We would like to test whether a home team really has an advantage over the visiting team.

  1. The instructor will assign you to one of two teams. Team 1 will argue that the t-test is appropriate to this scenario. Team 2 will argue that the t-test is invalid. Take a few minutes to examine the data, then formulate your best argument.

  2. Should you perform a one-tailed test or a two-tailed test? What is the strongest argument for your answer?

  3. Execute the t-test and interpret every component of the output.

  4. Based on your output, suggest a different hypothesis that would have led to a different test result. Try executing the test to confirm that you are correct.

Assumptions Behind the t-test

For the following scenarios, what is the strongest argument against the validity of a t-test?

  • You have a sample of 50 CEO salaries, and you want to know whether the mean salary is greater than $1 million.

  • A nonprofit organization measures the percentage of students that pass an 8th grade reading test in 40 neighboring California counties. You are interested in whether the percentage of students that pass in California is over 80%

  • You have survey data in which respondents assess their own opinion of corgis, with options ranging from “1 - extreme disgust” to “5 - affection so intense it threatens my career.” You want to know whether people on the average like corgis more than 3, representing neutrality.

Comparing Two Groups

Learning Objectives

This week, we introduce the idea of comparing two groups to evaluate whether the data that we have sampled lead us to believe that the population distribution of the random variables are different. Of course, because we don’t get access to the function that describes the random variable, we can’t actually know if the populations are different. It is for this reason that we call it statistical inference – we are inferring from a sample some belief about the populations.

At the conclusion of this week, students will be able to:

  1. Recognize the similarities between all frequentist hypothesis tests.
  2. Evaluate the conditions that surround the data, and choose a test that is best-powered and justifiable.
  3. Perform and Interpret the results of the most common statistical tests.

Class Announcements

  • Great work completing your final w203 test!
  • There is no unit 7 homework!
  • The Hypothesis Testing Lab is released today!
    • Lab is due at Unit 09 Live Session (two weeks)
    • Group lab in two parts:
      • Part 1: Work as a team to engage the fundamentals of hypothesis tests
      • Part 2: Apply these fundamentals to analyze 2020 election data and write a single, three-page analysis

Roadmap

Rearview Mirror

  • Statisticians create a population model to represent the world
  • A population model has parameters we are interested in
    • Ex: A parameter might represent the effect that a vitamin has on test performance
  • A null hypothesis is a specific statement about a parameter
    • Ex: The vitamin has zero effect on performance
  • A hypothesis test is a procedure for rejecting or not rejecting a null, such the probability of a type 1 error is constrained.

Today

  • There are often multiple hypothesis tests you can apply to a scenario.
  • Our primary concern is choosing a test with assumptions we can defend.
  • Secondarily, we want to maximize power.

Looking ahead

  • Next week, we start working with models for linear regression
  • We will see how hypothesis testing is also used for regression parameters.

Teamwork Discussion

Working on Data Science Teams

Data science is a beautiful combination of team-work and individual-work. It provides the opportunity to work together on a data pipeline with people from all over the organization, to deal with technical, statistical, and social questions that are always interesting. While we expect that everyone on a team will be a professional, there is so much range within the pursuit of data science that projects are nearly always collaborative exercises.

Together as teams, we

  • Define research ambitions and scope
  • Imagine/envision the landscape of what is possible
  • Support, discuss, review and integrate individual contributions

Together as individuals we conduct the heads-down work that moves question answering forward. This might be reading papers to determine the most appropriate method to bring to bear on the question, or researching the data that is available, or understanding the technical requirements that we have to meet for this answer to be useful to our organization in real time.

What is your instructor uniquely capable of? Let them tell you! But, at the same time, what would they acknowledge that others are better than them?

See, the thing is, because there is so much that has to be done, there literally are very, very few people who are one-stop data science shops. Instead, teams rely on collaboration and joint expertise in order to get good work done.

The Problematic Psychology of Data Science

People talk about the impostor syndrome: a feeling of inadequacy or interloping that is sometimes also associated with a fear of under-performing relative to the expectation of others on the team. These emotions are common through data science, academics, computer science. But, these types of emotions are also commonplace in journalism, film-making, and public speaking.

Has anybody ever had the dream that they’re late to a test? Or, that that they’ve got to give a speech that they’re unprepared for? Does anybody remember playing an instrument as a kid and having to go to recitals? Or, play for a championship on a youth sports team? Or, go into a test two?

What are the feelings associated with those events? What might be generating these feelings?

What Makes an Effective Team?

What really matters to creating an effective tema is less about who is on the team, and more about how the team works together.

In your live session, your section might take 7 minutes to read this brief. If so, please read the following sections:

  • The problem statement;
  • The proposed solution;
  • The framework for team effectiveness, stopping at the section titled “Tool: Help teams determine their own needs.”

“Psychological safety refers to an individual’s perception of the consequences of taking an interpersonal risk. It is a belief that a team is safe for risk taking in the face of being seen as ignorant, incompetent, negative, or disruptive.”

“In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel confident that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea.”

We All Belong

  • From your experience, can you give an example of taking a personal risk as part of a team?
    • Can you describe your emotions when contemplating this risk?
    • If you did take the risk, how did the reactions of your teammates affect you?
  • Knowing the circumstances that generate feelings of anxiety – what steps can we take as a section, or a team, to recognize and respond to these circumstances?

How can you add to the psychological safety of your peers in the section and lab teammates?

Team Kick-Off

Lab 1 Teams

  • Here are teams for Lab 1!

Team Kick-Off Conversation

  • In a 10 minute breakout with your team, please discuss the following questions:
  1. How much time will you invest in the lab each week?
  2. How often will you meet and for how long?
  3. How will you discuss, review, and integrate individual work into the team deliverable?
  4. What do you see as the biggest risks when working on a team? How can you contribute to an effective team dynamic?

A Quick Review

Review of Key Terms

  • Define each of the following:
    • Population Parameter
    • Null Hypothesis
    • Test Statistic
    • Null Distribution

Comparing Groups Review

Take a moment to recall the tests you learned this week. Here is a quick cheat-sheet to their key assumptions.

paired/unpaired parametric non-parametric
unpaired unpaired t-test
- metric var
- i.i.d.
- (not too un-)normal
Wilcoxon rank-sum
ordinal var
i.i.d.
paired paired t-test
metric var
i.i.d.
(not too un-)normal
Wilcoxon signed-rank
metric var
i.i.d.
difference is symmetric

sign test
ordinal var
i.i.d.

Comparing Groups R Exercise

The General Social Survey (GSS) is one of the longest running and extensive survey projects in the US. The full dataset includes over 1000 variables spanning demographics, attitudes, and behaviors. The file GSS_w203.RData contains a small selection of a variables from the 2018 GSS.

To learn about each variable, you can enter it into the search bar at the GSS data explorer

load('data/GSS_w203.RData')
summary(GSS)
##            rincome               happy                           sexnow        wwwhr       
##  $25000 or more: 851   very happy   : 701   women                   :758   Min.   :  0.00  
##  $20000 - 24999: 107   pretty happy :1307   man                     :640   1st Qu.:  3.00  
##  $10000 - 14999:  94   not too happy: 336   transgender             :  2   Median :  8.00  
##  $15000 - 19999:  61   DK           :   0   a gender not listed here:  1   Mean   : 13.91  
##  lt $1000      :  33   IAP          :   0   Don't know              :  0   3rd Qu.: 20.00  
##  (Other)       : 169   NA           :   0   (Other)                 :  0   Max.   :140.00  
##  NA's          :1033   NA's         :   4   NA's                    :947   NA's   :986     
##     emailhr                     socrel                socommun      numpets      
##  Min.   :  0.000   sev times a week:382   never           :510   Min.   : 0.000  
##  1st Qu.:  0.000   sev times a mnth:287   once a month    :243   1st Qu.: 0.000  
##  Median :  2.000   once a month    :259   sev times a week:219   Median : 1.000  
##  Mean   :  7.152   sev times a year:240   sev times a year:196   Mean   : 1.718  
##  3rd Qu.: 10.000   almost daily    :217   sev times a mnth:174   3rd Qu.: 2.000  
##  Max.   :100.000   (Other)         :171   (Other)         :215   Max.   :20.000  
##  NA's   :929       NA's            :792   NA's            :791   NA's   :1201    
##     tvhours                           major1         owngun   
##  Min.   : 0.000   business administration: 138   yes    :537  
##  1st Qu.: 1.000   education              :  79   no     :993  
##  Median : 2.000   engineering            :  54   refused: 39  
##  Mean   : 2.938   nursing                :  51   DK     :  0  
##  3rd Qu.: 4.000   health                 :  42   IAP    :  0  
##  Max.   :24.000   (Other)                : 546   NA     :  0  
##  NA's   :793      NA's                   :1438   NA's   :779

You have a set of questions that you would like to answer with a statistical test. For each question:

  1. Choose the most appropriate test.
  2. List and evaluate the assumptions for your test.
  3. Conduct your test.
  4. Discuss statistical and practical significance.

The Questions

Set 1

  • Do economics majors watch more or less TV than computer science majors?
library(tidyverse)
library(magrittr)
library(ggthemes)

GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %$% 
  t.test(tvhours ~ major1)
## 
##  Welch Two Sample t-test
## 
## data:  tvhours by major1
## t = 1.7123, df = 11.663, p-value = 0.1133
## alternative hypothesis: true difference in means between group computer science and group economics is not equal to 0
## 95 percent confidence interval:
##  -0.3134265  2.5800931
## sample estimates:
## mean in group computer science        mean in group economics 
##                       2.600000                       1.466667
GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %$% 
  wilcox.test(tvhours ~ major1)
## Warning in wilcox.test.default(x = c(1L, 2L, 2L, 1L, 7L, 5L, 1L, 3L, 2L, : cannot compute exact
## p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  tvhours by major1
## W = 101, p-value = 0.1372
## alternative hypothesis: true location shift is not equal to 0
GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %>% 
  mutate(tv_category = case_when(
    tvhours < 2 ~ 1,
    tvhours >= 2 & tvhours < 5 ~ 2,
    tvhours >= 5 ~ 3)) %$%
  # t.test(tv_category ~ major1)
  wilcox.test(tv_category ~ major1)
## Warning in wilcox.test.default(x = c(1, 2, 2, 1, 3, 3, 1, 2, 2, 2), y = c(1, : cannot compute
## exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  tv_category by major1
## W = 99.5, p-value = 0.1379
## alternative hypothesis: true location shift is not equal to 0
GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %$% 
  wilcox.test(tvhours ~ major1)
## Warning in wilcox.test.default(x = c(1L, 2L, 2L, 1L, 7L, 5L, 1L, 3L, 2L, : cannot compute exact
## p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  tvhours by major1
## W = 101, p-value = 0.1372
## alternative hypothesis: true location shift is not equal to 0
GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %>% 
  ggplot() + 
  aes(x=tvhours) + 
  geom_histogram(bins=10)
## Warning: Removed 11 rows containing non-finite values (stat_bin).

  • Do Americans with pets watch more or less TV than Americans without pets?

Set 2

  • Do Americans spend more time emailing or using the web?
GSS %>% 
  select(wwwhr, emailhr) %>% 
  drop_na() %$% 
  t.test(x=wwwhr, y=emailhr, paired=TRUE)
## 
##  Paired t-test
## 
## data:  wwwhr and emailhr
## t = 13.44, df = 1360, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.530219 7.420553
## sample estimates:
## mean of the differences 
##                6.475386
GSS %>% 
  ggplot() + 
  geom_histogram(aes(x=wwwhr), fill = 'blue') + 
  geom_histogram(aes(x=emailhr, fill = 'red'))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 986 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 929 rows containing non-finite values (stat_bin).

t.test(
  x = GSS$wwwhr, 
  y = GSS$emailhr, 
  paired = FALSE
)
## 
##  Welch Two Sample t-test
## 
## data:  GSS$wwwhr and GSS$emailhr
## t = 12.073, df = 2398.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.657397 7.851614
## sample estimates:
## mean of x mean of y 
## 13.906021  7.151515
  • Do Americans spend more evenings with neighbors or with relatives?

What a great question this is… so glad for the chance to answer it, and to deal with this data that someone else has coded for me. Thank you universe.

wilcox_test_data <- GSS %>% 
  select(socrel, socommun) %>%
  mutate(
    family_ordered = factor(
      x      = socrel, 
      levels = c('almost daily', 'sev times a week', 
                 'sev times a mnth', 'once a month',
                 'sev times a year', 'once a year', 'never')),
    friends_ordered = factor(
      x      = socommun, 
      levels = c('almost daily', 'sev times a week', 
                 'sev times a mnth', 'once a month',
                 'sev times a year', 'once a year', 'never'))) 

To begin this investigation, we’ve got to look at the data and see what is in it. If you look below, you’ll note that it sure seems that people are spending more time with their family… erp, actually no. They’re “hanging out” with their friends rather than taking their mother out to dinner.

wilcox_test_data %>% 
  select(friends_ordered, family_ordered) %>% 
  rename(
    Friends = friends_ordered, 
    Family  = family_ordered
  ) %>% 
  drop_na() %>% 
  pivot_longer(cols = c(Friends, Family)) %>%   
  ggplot() + 
    aes(x=value, fill=name) + 
    geom_histogram(stat='count', position='dodge') + 
  scale_fill_manual(values = c('#003262', '#FDB515')) + 
  labs(
    title = 'Do Americans Spend Times With Friends or Family?',
    subtitle = 'A cutting analysis.', 
    fill = 'Friends or Family', 
    x = 'Amount of Time Spent') + 
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  theme_fivethirtyeight()
## Warning: Ignoring unknown parameters: binwidth, bins, pad

With this plot created, we can ask if what we observe in the plot is the produce of what could just be sampling error, or if this is something that was unlikely to arise due if the null hypothesis were true. What is the null hypothesis? Well, lets suppose that if we didn’t know anything about the data that we would expect there to be no difference between the amount of time spent with friends or families.

## risky choice -- casting the factor to a numeric without checking what happens.
wilcox_test_data %$% 
  wilcox.test(
    x=as.numeric(family_ordered), 
    y=as.numeric(friends_ordered),
    paired=FALSE
  )
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  as.numeric(family_ordered) and as.numeric(friends_ordered)
## W = 716676, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Set 3

  • Are Americans that own guns or Americans that don’t own guns more likely to have pets?
  • Are Americans with pets happier than Americans without pets?

OLS Regression Estimates

library(tidyverse)
library(broom)
library(testthat)
theme_set(theme_minimal())

Learning Objectives

Class Announcements

  1. Lab 1 is due next week.
  2. There is no HW 8. We will have HW 9 as usual.
  3. You’re doing great - keep it up!

Roadmap

Rear-View Mirror

  • Statisticians create a population model to represent the world.
  • Sometimes, the model includes an “outcome” random variable \(Y\) and “input” random variables \(X_1, X_2,...,X_k\).
  • The joint distribution of \(Y\) and \(X_1, X_2,...,X_k\) is complicated.
  • The best linear predictor (BLP) is the canonical way to summarize the relationship.

Today

  • OLS regression is an estimator for the BLP
  • We’ll discuss the mechanics of OLS

Looking Ahead

  • To make regression estimates useful, we need measures of uncertainty (standard errors, tests…).
  • The process of building a regression model looks different, depending on whether the goal is prediction, description, or explanation.

Regression Discussion

Working on Math

Meet us over at this link to work on some math together.

Visualizing Data

Consider that you have two variables, \(X_{1}\) and \(X_{2}\) that are predicting \(Y\).

Discussion Questions

Suppose we have random variables \(X\) and \(Y\).

  • Why do we care about the BLP?
  • What assumptions are needed for OLS to consistently estimate the BLP?
  • What assumptions are needed in terms of causality (\(X\) causes \(Y\), \(Y\) causes \(X\), etc.) in order to compute the regression of \(Y\) on \(X\)?

The Regression Anatomy Formula

We make the claim in live session that we can re-represent a coefficient that we’re interested in as a function of all the other variable in a regression. That is, suppose that we were interested, initially, in estimating the model:

\[ Y = \hat\beta_{0} + \hat\beta_{1} X_{1} + \hat\beta_{2} X_{2} + \hat\beta_{3}X_{3} + e \] that we can produce an estimate for \(\hat\beta_{1}\) by fitting this auxiliary regression,

\[ X_{1} = \hat\delta_{0} + \hat\delta_2X_2 + \hat\delta_3X_3 + r_{1} \]

And then using the residuals, noted as \(r_1\) above, in a second auxilary regression,

\[ Y = \gamma_0 + \gamma_1 r_1 \]

The claim that we make in the live session is that there is a guarantee that \(\beta_1 = \gamma_1\). Here, we are first going to show that this is true, and then we’re going to reason about what this means, and why this feature is interesting (or at least useful) when we are estimating a regression.

Suppose that the population model is the following:

\[ Y = -3 + (1\cdot X_1) + (2\cdot X_2) + (3\cdot X_3) \]

d <- data.frame(
  x1 = runif(n=100, min=0, max=10), 
  x2 = runif(n=100, min=0, max=10), 
  x3 = runif(n=100, min=0, max=10)
)

## because we know the population model, we can produce a single sample from it 
## using the following code: 

d <- d %>% 
  mutate(y = -3 + 1*x1 + 2*x2 + 3*x3 + rnorm(n=n(), mean=0, sd=3))

Notice that when we made this data, we included a set of random noise at the end. The idea here is that there are other “things” in this universe that also affect \(Y\), but that we don’t have access to them. By assumption, what we have measured in this world, \(X_1, X_2, X_3\) are uncorrelated with these other features.

model_main <- lm(y ~ x1 + x2 + x3, data = d)
coef(model_main)
## (Intercept)          x1          x2          x3 
##   -4.459032    1.000384    2.166609    3.112126

The claim is that we can produce an estimate of \(\beta_1\) using an auxiliary set of regression estimates, and then using the regression from that auxiliary regression.

model_aux <- lm(x1 ~ x2 + x3, data = d)

If we look into the structure of model_aux we can see that there are a ton of pieces in here.

## str(model_aux)

To evaluate our claim, we need to find the residuals from this regression. As a knowledge check, what is it that we mean when we say, “residual” in this sense?

To make talking about these easier, here is a plot that might be useful.

d %>% 
  ggplot() + 
  aes(x=x1, y=y) + 
  geom_point() + 
  geom_segment(aes(x=0, xend=10, y=0, yend=50), color = 'steelblue')

In order to access these residuals, we can “augment” the dataframe that we used in the model, using the broom::augment function call.

d_augmented <- augment(model_aux)
d_augmented$y <- d$y
d_augmented
## # A tibble: 100 × 10
##       x1    x2    x3 .fitted .resid   .hat .sigma  .cooksd .std.resid     y
##    <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl> <dbl>
##  1 0.176 9.06  1.14     4.76 -4.58  0.0399   2.81 0.0375       -1.65  20.6 
##  2 7.77  0.185 5.28     5.68  2.08  0.0454   2.85 0.00894       0.751 17.4 
##  3 7.57  1.92  9.42     5.95  1.63  0.0590   2.85 0.00729       0.590 35.6 
##  4 0.105 3.84  1.49     5.11 -5.01  0.0237   2.81 0.0257       -1.78   5.75
##  5 8.18  3.07  5.89     5.56  2.62  0.0202   2.84 0.00596       0.932 26.5 
##  6 8.56  5.28  8.78     5.68  2.88  0.0350   2.84 0.0129        1.03  43.2 
##  7 4.22  7.28  1.78     4.92 -0.701 0.0230   2.85 0.000489     -0.250 18.7 
##  8 6.34  9.54  4.71     5.05  1.30  0.0302   2.85 0.00223       0.463 33.6 
##  9 4.18  4.96  1.41     5.04 -0.856 0.0215   2.85 0.000681     -0.305 11.4 
## 10 3.25  1.32  0.994    5.23 -1.98  0.0449   2.85 0.00797      -0.713  1.57
## # … with 90 more rows

And finally, with this augmented data that has information from the model, we can estimate the model that includes only the residuals as predictors of \(Y\).

model_two <- lm(y ~ .resid, data = d_augmented)
coef(model_two)
## (Intercept)      .resid 
##   26.303218    1.000384

Our claim was that the coefficients from model_main and model_two should be the same.

test_that(
  'the model coefficients are equal', 
  expect_equal(
    as.numeric(coef(model_main)['x1']), 
    as.numeric(coef(model_two)['.resid']))
)
## Test passed 🥳

But, why is this an interesting, or at least useful, feature to appreciate?

Coding Activity:R Cheat Sheet

Suppose x and y are variables in dataframe d.

To fit an ols regression of Y on X:

mod <- lm(y ~ x, data = d)

To access coefficients from the model object:

mod$coefficients
or coef(mod)

To access fitted values from the model object:

mod$fitted
or fitted(mod)
or predict(mod)

To access residuals from the model object:

mod$residuals
or resid(mod)

To create a scatterplot that includes the regression line:

plot(d['x'], d['y'])
abline(mod)
or 
d %>% 
  ggplot() + 
  aes(x = x, y = y) + 
  geom_point() + 
  geom_smooth(method = lm)

R Exercise

Real Estate in Boston

The file hprice1.Rdata contains 88 observations of homes in the Boston area, taken from the real estate pages of the Boston Globe during 1990. This data was provided by Wooldridge.

load('data/hprice1.RData') # provides 3 objects 
head(data)
##     price assess bdrms lotsize sqrft colonial   lprice  lassess llotsize   lsqrft
## 1 300.000  349.1     4    6126  2438        1 5.703783 5.855359 8.720297 7.798934
## 2 370.000  351.5     3    9903  2076        1 5.913503 5.862210 9.200593 7.638198
## 3 191.000  217.7     3    5200  1374        0 5.252274 5.383118 8.556414 7.225482
## 4 195.000  231.8     3    4600  1448        1 5.273000 5.445875 8.433811 7.277938
## 5 373.000  319.1     4    6095  2514        1 5.921578 5.765504 8.715224 7.829630
## 6 466.275  414.5     5    8566  2754        1 6.144775 6.027073 9.055556 7.920810
  • Are there variables that would not be valid outcomes for an OLS regression? If so, why?
  • Are there variables that would not be valid inputs for an OLS regression? If so, why?

Assess the Relationship between Price and Square Footage

Suppose that you’re interested in knowing the relationship between price and square footage.

  1. Assess the assumptions of the Large-Sample Linear Model.

  2. Create a scatterplot of price and sqrft. Like every plot you make, ensure that the plot minimally has a title and meaningful axes.

  3. Find the correlation between the two variables.

  4. Recall the equation for the slope of the OLS regression line – here you can either use Variance and Covariance, or if you’re bold, the linear algebra. Compute the slope manually (without using lm())

  5. Regress price on sqrft using the lm function. This will produce an estimate for the following model:

\[ price = \beta_{0} + \beta_{1} sqrft + e \]

  1. Create a scatterplot that includes the fitted regression.

  2. Interpret what the coefficient means.

  • State what features you are allowing to change and what features you’re requiring do not change.
  • For each additional square foot, how much more (or less) is the house worth?
  1. Estimate a new model (and save it into another object) that includes the size of the lot and whether the house is a colonial. This will estimate the model:

\[ price = \beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3} colonial? + e \]

  • BUT BEFORE YOU DO, make a prediction: What do you think is going to happen to the coefficient that relates square footage and price?
    • Will the coefficient increase, decrease, or stay the same?
  1. Compute the sample correlation between \(X\) and \(e_i\). What guarantees do we have from the book about this correlation? Does the data seem to bear this out?

Regression Plots and Discussion

In this next set of notes, we’re going to give some data, displayed in plots, and we will try to apply what we have learned in the async and reading for this week to answer questions about each of the scatter plots.

Plot 1

Consider data that is generated according to the following function:

\[ Y = 1 + 2x_1 + 3x_2 + e, \]

where \(x_1 \sim N(0,2)\), \(x_2 \sim N(0,2)\) and \(e\) is a constant equal to zero.

From this population, you might consider taking a sample of 100 observations, and representing this data in the following 3d scatter plot. In this plot, there are three dimensions, an \(x_1, x_2\), and \(y\) dimensions.

knitr::include_app(url ="http://www.statistics.wtf/minibeta_l01/")
  1. Rotate the cube and explore the data, looking at each face of the cube, including from the top down.
  2. One of the lessons that we learned during the random variables section of the course is that every random variable that has been measured can also be marginalized off. You might think of this as “casting down” data from three dimensions, to only two.
  3. Sketch the following 2d scatter plots, taking care the label your axes. You need not represent all 100 points, but rather create the gestalt of what you see.
    1. \(Y = f(x_1)\) (but not \(x_2\)) 2. \(Y = f(x_2)\) (but not \(x_1\)) 3. \(x2 = f(x_1)\)
  4. Once you have sketched the scatter plots, what line would you fit that minimizes the sum of squared residuals in the vertical direction. Define a residual, \(\epsilon\), to be the vertical distance between the line you draw, and the corresponding point on the input data.
  5. What is the average of the residuals for each of the lines that you have fitted? How does this correspond to the moment conditions discussed in the async? What would happen if you translated this line vertically?
  6. Rotate the cube so that the points “fall into line”. When you see this line, how does it help you describe the function that governs this data?

OLS Regression Inference

Learning Objectives

Class Announcements

  1. Congratulations on finishing your first lab!
  2. The next (and the last) lab is coming up in two weeks.
  3. Homework 09 has been assigned today, and it’s due in a week.

Roadmap

Rear-View Mirror

  • Statisticians create a population model to represent the world.
  • Sometimes, the model includes an “outcome” random variable \(Y\) and “input” random variables \(X_1, X_2,...,X_k\).
  • The joint distribution of \(Y\) and \(X_1, X_2,...,X_k\) is complicated.
  • The best linear predictor (BLP) is the canonical way to summarize the relationship.
  • OLS provides a point estimate of the BLP

Today

  • Robust Standard Error: quantify the uncertainty of OLS coefficients
  • Hypothesis testing with OLS coefficients
  • Bootstrapping

Looking Ahead

  • Regression is a foundational tool that can be applied to different contexts
  • The process of building a regression model looks different, depending on whether the goal is prediction, description, or explanation.

Uncertainty in OLS

Discussion Questions

  • List as many differences between the BLP and the OLS line as you can.
  • In the following regression table, explain in your own words what the standard error in parentheses means.
outcome: sleep hours
mg. melatonin 0.52
(0.31)

Understanding Uncertainty

Under the relatively stricter assumptions of constant error variance, the variance of a slope coefficient is given by

\[ V(\hat{\beta_j}) = \frac{\sigma^2}{SST_j (1-R_j^2)} \]

A similar formulation is given in FOAS as definition 4.2.3,

\[ \hat{V}_{C}[\hat{\beta}] = \hat{\sigma}^2 \left( X^{T} X \right)^{-1} \rightsquigarrow \frac{\hat{\sigma}^{2}}{\left( X^{T}X\right)} \]

Explain why each term makes the variance higher or lower:

  • \(\sigma^2\) is the variance of the error \(\epsilon\)
  • \(SST_j\) is (unscaled) variance of \(X_j\)
  • \(R_j^2\) is \(R^2\) for a regression of \(X_j\) on the other \(X\)’s

R Exercise

Real Estate in Boston

The file hprice1.RData contains 88 observations of homes in the Boston area, taken from the real estate pages of the Boston Globe during 1990. This data was provided by Wooldridge.

load('data/hprice1.RData') # provides 3 objects 

Last week, we fit a regression of price on square feet.

model_one <- lm(price ~ sqrft, data = data)
model_one
## 
## Call:
## lm(formula = price ~ sqrft, data = data)
## 
## Coefficients:
## (Intercept)        sqrft  
##     11.2041       0.1402

Questions

  1. Estimate a new model (and save it into another object) that includes the size of the lot and whether the house is a colonial. This will estimate the model:

\[ price = \beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3} colonial? + e \]

  • BUT BEFORE YOU DO, make a prediction: What do you think is going to happen to the coefficient that relates square footage and price?
    • Will the coefficient increase, decrease, or stay the same?
    • Will the uncertainty about the coefficient increase, decrease, or stay the same?
    • Conduct an F-test that evaluates whether the model as a whole does better when the coefficients on colonial and lotsize are allowed to estimate freely, or instead are restricted to be zero (i.e. \(\beta_{2} = \beta_{3} = 0\).
  1. Use the function vcovHC from the sandwich package to estimate (a) the the heteroskedastic consistent (i.e. “robust”) variance covariance matrix; and (b) the robust standard errors for the intercept and slope of this regression. Recall, what is the relationship between the VCOV and SE in a regression?

  2. Perform a hypothesis test to check whether the population relationship between sqrft and price is zero. Use coeftest() with the robust standard errors computed above.

  3. Use the robust standard error and qt to compute a 95% confidence interval for the coefficient sqrft in the second model that you estimated. \(price = \beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3} colonial\).

  4. Bootstrap. The book very quickly talks about bootstrapping which is the process of sampling with replacement and fitting a model. The idea behind the bootstrap is that since the data is generated via an iid sample from the population, that you can simulate re-running your analysis by drawing repeated samples from the data that you have.

Below is code that will conduct a boostrapping estimator of the uncertainty of the sqrft variable when lotsize and colonial are included in the model.

bootstrap_sqft <- function(d = data, number_of_bootstraps = 1000) { 
  number_of_rows <- nrow(d)

    coef_sqft <- rep(NA, number_of_bootstraps)

    for(i in 1:number_of_bootstraps) { 
      bootstrap_data <- d[sample(x=1:number_of_rows, size=number_of_rows, replace=TRUE), ]  
      estimated_model <- lm(price ~ sqrft + lotsize + colonial, data = bootstrap_data)
      coef_sqft[i]    <- coef(estimated_model)['sqrft']
    }
  return(coef_sqft)
  }
bootstrap_result <- bootstrap_sqft(number_of_bootstraps = 1000)

With this, it is possible to plot the distribution of these regression coefficients:

ggplot() + 
  aes(x = bootstrap_result) + 
  geom_histogram() + 
  labs(
    x = 'Estimated Coefficient', 
    y = 'Count', 
    title = 'Bootstrap coefficients for square footage'
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Compute the standard deviation of the bootstrapped regression coefficients. How does this compare to the robust standard errors you computed above?

Descriptive Model Building

Learning Objectives

Class Announcements

  1. The Regression Lab begins next week.
    • Your instructor will divide you into teams.
    • As part of the lab, you will perform a statistical analysis using linear regression models.

Roadmap

Rearview Mirror

  • Statisticians create a population model to represent the world.

  • The BLP is a useful way to summarize the relationship between one outcome random variable \(Y\) and input random varibles \(X_1,...,X_k\)

  • OLS regression is an estimator for the Best Linear Predictor (BLP)

  • We can capture the sampling uncertainty in an OLS regression with standard errors, and tests for model parameters.

Today

  • The research goal determines the strategy for building a linear model.

  • Description means summarizing or representing data in a compact, human-understandable way.

  • We will capture complex relationships by transforming data, including using indicator variables and interaction terms.

Looking Ahead

  • We will see how model building for explanation is different from building for description.

  • The famous Classical Linear Model (CLM) allows us to apply regression to smaller samples.

Discussion

Three modes of model building

  • Recall the three major modes of model building: Prediction, Description, Explanation.
  • What is the appropriate mode for each of the following questions?
  1. What is going on?
  2. Why is something going on?
  3. What is going to happen?
  • Think of a research question you are interested in. Which mode is it aligned with?

The statistical modeling process in different modes

  • How does the modeling goal influence each of the following steps in the statistical modeling process?

    • Choice of variables and transformation

    • Choice of model (ols regression, neural nets, random forest, etc.)

    • Model evaluation

R Activity: Measuring the return to education

  • In labor economics, a key concept is returns to education.
  • Our goal is description: what is the relationship between education and wages? We will proceed in two steps:
    • First, we will discuss what the appropriate specifications are.
    • Then we will estimate the different models to answer this question.
  • We will use wage1 dataset in the wooldridge package in the following sections.
#?wage1
#names(wage1)

Transformations

Applying and Interpreting Logarithms

  • Which of the following specifications best capture the relationship between education and hourly wage? (Hint: Do a quick a EDA)

    • level-level: \(wage = \beta_0 + \beta_1 educ + u\)
    • Level-log: \(wage = \beta_0 + \beta_1 \ln(educ) + u\)
    • log-level: \(\ln(wage) = \beta_0 + \beta_1 educ + u\)
    • log-log: \(\ln(wage) = \beta_0 + \beta_1 \ln(educ) + u\)
  • What is the interpretation of \(\beta_0\) and \(\beta_1\) in your selected specification?

  • Can we use \(R^2\) or Adjusted \(R^2\) to choose between level-level or log-level specifications?

Remember

  • Doing a log transformation for any reason essentially implies a fundamentally different relationship between outcome (Y) and predictor (X) that we need to capture

Applying and Interpreting Polynomials

  • The following specifications include two control variables: years of experience (exper) and years at current company (tenure).

  • Do a quick EDA and select the specification that better suits our description goal.

    • \(wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u\)

    • \(\begin{aligned} wage &= \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \\ & \beta_4 tenure + \beta_5 tenure^2 + u \end{aligned}\)

  • How do you interpret the \(\beta\) coefficients?

Applying and Interpreting Indicator variables and interaction terms

  • In the following models, first, explain why the indicator variables or interaction terms have been included. Then identify the reference group (if any) and interpret all coefficients.

    • \(wage = \beta_0 + \beta_1 educ + \beta_2 I(educ \geq 12) + u\)

    • \(wage = \beta_0 + \beta_1 educ + \beta_2 female + u\)

    • \(wage = \beta_0 + \beta_1 educ + \beta_2 female + \beta_3 educ*female + u\)

    • \(\begin{aligned} wage &= \beta_0 + \beta_1 female + \beta_2 I(educ = 2) + \beta_3 I(educ = 3)\\ &...+ \beta_{20} I(educ = 20) + u\\ \end{aligned}\)

Estimation

Estimating Returns to Education

  • Answer the following questions using an appropriate hypothesis test.
    1. Is a year of education associated with changes to hourly wage? (Include experience and tenure without polynomial terms).
    2. Is the association between wage and experience / wage and tenure non-linear?
    3. Is there evidence for gender wage discrimination in the U.S.?
    4. Is there any evidence for a graduation effect on wage?
  • Display all estimated models in a regression table, and discuss the robustness of your results.

Explanatory Model Building

Learning Objectives

Class Announcements

Lab 2-Regression

Overview

  • Setting: You are data scientists for a maker of products.
  • Task: You select your own research question
    • Your X should be an aspect of product design
    • Your Y should be a metric of product success
  • Deliverable: A statistical analysis that includes
    • An introduction that motivates your research question
    • A description of your model-building process
    • A discussion of statistical assumptions that may be problematic
    • A well-formatted regression table with a minimum of 3 specifications
    • A conclusion that extracts key lessons from your statistical results

The Report
- Writing for a collaborating data scientist, what research question have you asked, what answers have you found, and how did you find them?

Deliverable Name Week Due Grade Weight
Research Proposal Week 12 10%
Within-Team Review Week 12 5%
Final Presentation Week 14 10%
Final Report Week 14 75%

Team Work Evaluation

  • Most data science work happens on teams.
  • Our educational goals include helping you improve in your role as a teammate.
  • We’ll ask you to fill out a confidential evaluation regarding your team dynamics.

Final Presentation

  • Team will present their work in live session 14.
    • Teams have between 10-15 min dedicated to discussing their work (depending on section size)
    • Two-thirds of the time can be the team presenting
    • BUT at least one-third should be asking and answering questions with your peers
    • For example, if teams have 15 minutes total, then plan to present for no more than 10 minutes and structure 5 minutes of questions.

Roadmap

Rearview Mirror

  • Statisticians create a population model to represent the world.
  • The BLP is a useful way to summarize relationships in a model, and OLS regression is a way to estimate the BLP.
  • OLS regression is a foundational tool that can be applied to questions of description

Today

  • Questions of explanation require a substantially different modeling process.
  • To answer causal questions, we must work within a causal theory
  • OLS regression is sometimes appropriate for measuring a causal effect,
  • But, only when the model estimated matches the causal theory.
  • So, we must watch out for omitted variable bias, reverse causality, and outcome variables on the right hand side.

Looking Ahead

  • The famous Classical Linear Model (CLM) allows us to apply regression to smaller samples.
  • We will address the pervasive issue of false discovery, and ways to be a responsible member of the scientific community.

Discussion

Path Diagrams

\[ \begin{matrix} \\ \text{Sleep} \rightarrow \text{Feelings of Stress} \\ \\ \end{matrix} \]

  • How would the following fit into this causal path diagram?
    1. All the other factors in the world that also cause stress but don’t have a causal relationship with sleep.
    2. A factor: Coffee Intake
      • What happens if you omit it in your regression?
    3. Reverse causality
    4. An outcome variable on the RHS: Job Performance
      • What happens if you include it in your regression?

Omitted Variable Bias

  • Recall the equation for omitted variable bias

  • What specific regressions do \(\beta_2\) and \(\gamma_1\) come from?

R Exercise

Omitted Variable Bias in R

The file htv.RData contains data from the 1991 National Longitudinal Survey of Youth, provided by Wooldridge. All people in the sample are males age 26 to 34. The data is interesting here, because it includes education, stored in the variable educ, and also a score on an ability test, stored in the variable abil.

Assume that the true model is,

Questions:

1- Are we able to directly measure ability? If so, how would you propose to measure it?

2- If not, what do we measure and how is this measurement related to ability? And there is a lot of evidence to suggest that standardized tests are not a very good proxy. But for now, let’s pretend that we really are measuring ability.

3- Using R, estimate (a) the true model, and (b) the regression of ability on education.

  • Write down the expression for what omitted variable bias would be if you couldn’t measure ability.

  • Add this omitted variable bias to the coefficient for education to see what it would be.

4- Now evaluate your previous result by fitting the model, \[wage = \alpha_0 + \alpha_1 educ + w\]

  • Does the coefficient for the relationship between education and wages match what you estimated earlier?

  • Why or why not?

5- Reflect on your results:

  • What does the direction of omitted variable bias suggest about OLS estimates of returns to education?

  • What does this suggest about the reported statistical significance of education?

Discussion

The Direction of Omitted Variable Bias

  • For each regression, estimate whether omitted variable bias is towards zero or away from zero.
Regression Output Omitted Variable
\(\widehat{grade} = 72.1 + 0.4\ attendance\) \(time\_studying\)
\(\widehat{lifespan} = 87.4 - 1.2\ cigarettes\) \(exercise\)
\(\widehat{lifespan} = 87.4 - 1.2\ cigarettes\) \(time\_socializing\)
\(\widehat{wage} = 14.0 + 2.1\ grad\_education\) \(experience\)
\(\widehat{wage} = 14.0 + 2.1\ grad\_education\) desire to effect \(social\_good\)
\(\widehat{literacy} = 54 + 12\ network\_access\) \(wealth\)

The Classical Linear Model

Learning Objectives

Class Announcements

  • Lab 2 Deliverable and Dates
    • Research Proposal (Today)
    • Within-Team Review (Today)
    • Final Report (Week 14)
    • Final Presentation (Week 14)

Roadmap

Rearview Mirror

  • Statisticians create a population model to represent the world.
  • The BLP is a useful summary for a relationship among random variables.
  • OLS regression is an estimator for the Best Linear Predictor (BLP).
  • For a large sample, we only need two mild assumptions to work with OLS
    • To know coefficients are consistent
    • To have valid standard errors, hypothesis tests

Today

  • The Classical Linear Model (CLM) allows us to apply regression to smaller samples.
  • The CLM requires more to be true of the data generating process, to make coefficients, standard errors, and tests meaningful in small samples.
  • Understanding if the data meets these requirements (often called assumptions) requires considerable care.

Looking Ahead

  • The CLM – and the methods that we use to evaluate the CLM – are the basis of advanced models (inter alia time-series)
  • (Week 13) In a regression studies (and other studies), false discovery is a widespread problem. Understanding its causes can make you a better member of the scientific community.

The Classical Linear Model

Comparing the Large Sample Model and the CLM

Part I

  • We say that in small samples, more needs be true of our data for OLS regression to “work.”
    • What do we mean when we say “work”?
      • If our goals are descriptive, how is a “working” estimator useful?
      • If our goals are explanatory, how is a “working” estimator useful?
      • If our goals are predictive, are the requirements the same?

Part II

  • Suppose that you’re interested in understanding how subsidized school meals benefit under-resourced students.
    • Using the tools from 201, refine this question to a data science question.
    • Suppose that to answer the question you have identified, there are two data sources:
      • Individual-level data about income, nutrition and test scores, self-reported by individual families who have opted in to the study.
      • Government data about school district characteristics, including district-level college achievement; county-level home prices, and state-level tax receipts.
    • What are the tradeoffs to these different sources?

Part III

  • Suppose you use individual-level data (you have a large sample).
    • Which of the large-sample assumptions do you expect are valid, and which are problematic?
  • Say you use school-district-level data (you have a small sample).
    • Which of the CLM assumptions do you expect are valid, and which do you expect are most problematic?
  • Which dataset do you think will give you more precise estimates?

Problems with the CLM Requirements

  • There are five requirements for the CLM

    1. IID Sampling
    2. Linear Conditional Expectation
    3. No Perfect Collinearity
    4. Homoskedastic Errors
    5. Normally Distributed Errors
  • For each of these requirements:

    • Identify one concrete way that the data might not satisfy the requirement.
    • Identify what the consequence of failing to satisfy the requirement would be.
    • Identify a path forward to satisfy the requirement.

R Exercise

library(tidyverse)
library(wooldridge)
library(car)
library(lmtest)
library(sandwich)
library(stargazer)

If you haven’t used the mtcars dataset, you haven’t been through an intro applied stats class!

In this analysis, we will use the mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). The dataset is automatically available when you start R. For more information about the dataset, use the R command: help(mtcars)

data(mtcars)

glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, …
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230,…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.0…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.07…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.4…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, …
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, …

Questions:

  1. Using the mtcars data, run a multiple linear regression to find the relationship between displacement (disp), gross horsepower (hp), weight (wt), and rear axle ratio (drat) on the miles per gallon (mpg).

  2. For each of the following CLM assumptions, assess whether the assumption holds. Where possible, demonstrate multiple ways of assessing an assumption. When an assumption appears violated, state what steps you would take in response.

    • I.I.D. data
    • Linear conditional expectation
    • No perfect collinearity
    • Homoskedastic errors
    • Normally distributed errors
  3. In addition to the above, assess to what extent (imperfect) collinearity is affecting your inference.

  4. Interpret the coefficient on horsepower.

  5. Perform a hypothesis test to assess whether rear axle ratio has an effect on mpg. What assumptions need to be true for this hypothesis test to be informative? Are they?

  6. Choose variable transformations (if any) for each variable, and try to better meet the assumptions of the CLM (which also maintaining the readability of your model).

  7. (As time allows) report the results of both models in a nicely formatted regression table.

Reproducible Research

Learning Objectives

Class Announcements

Roadmap

Rearview Mirror

Today

Looking Ahead

What data science hopes to accomplish

  • As a data scientist, our goal is to learn about the world:
    • Theorists and theologians build systems of explanations that are consistent with themselves
    • Analysts build systems of explanations that are consistent with the past
    • Scientists build systems of explanations that usefully predict events, or data, that hasn’t yet been seen

Learning from Data

  • As a data scientist, the way we learn about the world is through the streams of data that real world events produce
    • Machine processes
    • Political outcomes
    • Customer actions
  • The watershed moment in our field has been the profusion of data available, from many places, that is richer than at any other point in our past.
    • In 251, and 266 we place structure on data series like audio, video and text that are transcendently rich
    • In 261 we bring together flows of data that are generated at massive scales
    • In 209 we ask, “How can we take data, and produce a new form of it that is most effectively understood by the human visual and interactive mind?

Data Science and Statistics

  • So why statistics?
  • And why the way we’ve chosen to approach statistics in 203?

Why Statistics?: A Closing Argument for Statistics

  • Business, policy, education and medical decisions are made by humans based on data

  • A central task when we observe some pattern in data is to infer whether the pattern will occur in some novel context

  • Statistics, as we practice it in 203, allows us to characterize:

    • What we have seen
    • What we could have seen
    • Whether any guarantees exist about what we have seen
    • What we can infer about the population
  • So that we can either describe, explain or predict behavior.

Course Goals

Course Section III: Purpose-Driven Models

  • Statistical models are unknowing transformations of data

    • Because they’re built on the foundation of probability, we have certain guarantees what a model “says”
    • Because they’re unknowing, the models themselves know-not what they say.
  • As the data scientist, bring them alive to achieve our modeling goals

  • In Lab 2 we have expanded our ability to parse the world using regression, built a model that accomplishes our goals, and done so in a way that brings the ability to test under a “null” scenario

    • Key insight: regression is little more than conditional averages

Course Section II: Sampling Theory and Testing

  • Under very general assumptions, sample averages follow a predictable, known, distribution – the Gaussian distribution
  • This is true, even when the underlying probability distribution is very complex, or unknown!
  • Due to this common distribution, we can produce reliable, general tests!
  • In Lab 1 we computed simple statistics, and used guarantees from sampling theory to test whether these differences were likely to arise under a “null” scenario

Course Section I: Probability Theory

  • Probability theory
    • Underlies modeling and regression (Part III);
    • Underlies sampling, inference, and testing (Part II)
    • Every model built in every corner of data science

We can:

  • Model the complex world that we live in using probability theory;
  • Move from a probability density function that is defined in terms of a single variable, into a function that is defined in terms of many variables
  • Compute useful summaries – i.e. the BLP, expected value, and covariance – even with highly complex probability density functions.

Statistics as a Foundation for MIDS

  • In w203, we hope to have laid a foundation in probability that can be used not only in statistical applications, but also in every other machine learning application that are likely to ever encounter

Reproducibility Discussion

Green Jelly Beans

What went wrong here?

Discussion

Status Update You have a dataset of the number of Facebook status updates by day of the week. You run 7 different t-tests, one for posts on Monday (versus all other days), or for Tuesday (versus all other days), etc. Only the test for Sunday is significant, with a p-value of .045, so you throw out the other tests.

Should you conclude that Sunday has a significant effect on number of posts? (How can you address this situation responsibly when you publish your results?)

Such Update As before, you have a dataset of the number of Facebook status updates by day of the week. You do a little EDA and notice that Sunday seems to have more “status updates” than all other days, so you recode your “day of the week” variable into a binary one: Sunday = 1, All other days = 0. You run a t-test and get a p-value of .045. Should you conclude that Sunday has a significant effect on number of posts?

Sunday Funday Suppose researcher A tests if Monday has an effect (versus all other days), Researcher B tests Tuesday (versus all other days), and so forth. Only Researcher G, who tests Sunday finds a significant effect with a p-value of .045. Only Researcher G gets to publish her work. If you read the paper, should you conclude that Sunday has a significant effect on number of posts?

Sunday Repentence What if researcher G above is a sociologist that chooses to measure the effect of Sunday based on years of observing the way people behave on weekends? Researcher G is not interested in the other tests, because Sunday is the interesting day from her perspective, and she wouldn’t expect any of the other tests to be significant.

Decreasing Effect Sizes Many observers have noted that as studies yielding statistically significant results are repeated, estimated effect sizes go down and often become insignificant. Why is this the case?

Maximum Likelihood Estimation

salvation mountain

Learning Objectives

Class Announcements

Roadmap

Rearview Mirror: What We’ve Seen

  • WLLN: \(\displaystyle\lim_{n \to \infty} \overline{X}_n \overset{p}{=} E[X]\)
  • CLT \(\displaystyle\lim_{n \to \infty} \bar{X}_n \overset{d}{=} N(E[X], \text{Var}[X])\)

Today

  • Use maximum likelihood to generate a good guess for model parameters;
  • Use a confidence interval to indicate a range of plausible parameter values

What is a model?

  • A data science model is:
    • A representation of the world built from random variables
    • FOIS: “agnostic” models place minimal restrictions on joint distribution
    • Parametric models (i.e. MLE) are models based on a family of distributions.
    • \(f_{Y|X}(y|\mathbf{x}) \sim g(y, \mathbf{x}; \mathbf{\theta})\)

Estimation

  • We have the tools to use data to infer information about the (joint) distribution

  • Because the joint distribution is complicated, we’ll usually estimate simpler summaries of the joint distribution – e.g. \(E[X]\), \(V[X]\), \(E[Y|X]\), \(Cov[X,Y]\)

  • There are a number of techniques that you can use to develop an estimator for a parameter. These techniques vary in terms of the principle used to arrive at the estimator and the strength of the assumptions needed to support it.

  • However, all of these estimators are statistics meaning they are functions of the data \(\{X_i\}_{i=1}^n\)

Discussion of Maximum Likelihood Estimation

  1. What is the goal of estimating a parameter? Why is this something that we are interested in as data scientists?
  2. In your own words, describe how the method of maximum likelihood is used to estimate the unknown parameters.
  3. Why does a likelihood function have a \(\Pi\) (product operator) within it?
  4. Is it possible to estimate using maximum likelihood without writing down a model for the data?
  5. What happens if your model for the data is wrong? Are your estimates for the parameters “incorrect”? Or, are they “correct” within the context of the model that you’ve written down?

Optimization in R

  • The method of maximum likelihood requires an optimization routine.
  • For a few very simple probability models, a closed-form solution exists and the MLE can be derived by hand. (This is also potentially the case for OLS regression.)
  • But, instead lets use some machine learning to find the estimates that maximize the likelihood function.
  • There are many optimizers (e.g. optimize, and optim). optimize is the simplest to use, but only works in one dimension.

Optimization Example: Optimum Price

  • Suppose that a firm’s profit from selling a product is related to price, \(p\), and cost, \(c\), as follows:

\[ \text{profit} = (p - p^2) - c + 100 \]

  1. Explain how you would use calculus to find the maximizing price. Assume that cost is fixed.

  2. What is the firms revenue as p=0, cost = 2? What is it at p=10, cost = 2?

  3. Create a plot with the following characteristics:

    • On the x-axis is a sequence (seq()) of prices from [0, 10].
    • On the y-axis is the revenue as a function of those prices. Hold cost constant at c=2.
    • What does the best price seem to be?
  4. Solve this numerically in R, using the optimize() function.

    • Take note: using the default arguments, will optimize try to find a maximum or a minimum?
    • Check into the help documentation.
profit <- function(p, c) { 
  r = (p - p^2) - c + 100
  return(r) 
  } 
profit(p=2, c=2)
## [1] 96
best_price <- optimize(
  profit,                    # profit is the function
  lower = 0, upper = 1000,   # this is the low and high we consider
  c = 2,                     # here, we're passing cost into profit
  maximum = TRUE)            # we'd like to maximize, not minimize
best_price
## $maximum
## [1] 0.5
## 
## $objective
## [1] 98.25

MLE for Poisson Random Variables

  • Suppose we use a camera to record an intersection for a particular length of time, and we write down the number of cars accidents in that interval.
  • This process can be modeled by a Poisson random variable (now we are non-agnostic), that has a well-known probability mass function given by,

\[ f(x;\lambda) = \frac{\lambda^x e^{-\lambda}}{x!} \]

Here is an example of a string of outcomes generated by a Poisson RV, with parameter \(\lambda = 2\).

rpois(n = 10, lambda = 2)
##  [1] 2 1 2 2 2 2 3 1 2 2

MLE for Poisson Random Variables: Data

  • Suppose that we conduct an iid sample, and gather the following number of accidents. (It is a busy street!)
data <- c(
  2, 6, 2, 1, 3, 3, 4, 4, 24, 1, 5, 4, 5, 1,  2, 2, 5, 2, 1, 5, 
  2, 1, 2, 9, 9, 1, 3, 2, 1,  1, 3, 1, 3, 2,  2, 4, 1, 1, 5, 3, 
  3, 2, 2, 1, 1, 1, 5, 1, 3,  1, 1, 1, 1, 2,  2, 4, 2, 1, 2, 2, 
  3, 1, 2, 6, 2, 2, 3, 2, 3,  5, 1, 3, 2, 5,  2, 1, 3, 2, 1, 2, 
  4, 2, 6, 1, 2, 2, 3, 5, 2,  1, 4, 2, 2, 1,  3, 2, 2, 4, 1, 1, 
  1, 1, 2, 3, 5, 1, 2, 2, 3,  1, 4, 1, 3, 2,  2, 2, 2, 2, 2, 3, 
  3, 1, 1, 2, 2, 4, 1, 5, 2,  7, 5, 2, 3, 2,  5, 3, 1, 2, 1, 1, 
  2, 3, 1, 5, 3, 4, 6, 3, 3,  2, 2, 1, 2, 2,  4, 2, 3, 4, 3, 1, 
  6, 3, 1, 2, 3, 2, 2, 3, 1,  1, 1, 1, 1, 10, 3, 2, 1, 1, 3, 2, 
  2, 3, 1, 1, 2, 2, 2, 4, 2,  2, 3, 3, 6, 1,  3, 2, 3, 2, 2, 2
  )

table(data)
## data
##  1  2  3  4  5  6  7  9 10 24 
## 54 69 38 14 14  6  1  2  1  1

MLE Estimation

  • Use the data that is stored in data, together with a Poisson model to estimate the \(\lambda\) values that produce the “good” model from the Poisson family.

  • That is, use MLE to estimate \(\lambda\).

  • Here is your work flow:

    1. Define your random variables.
    2. Write down the likelihood function for a sample of data that is generated by a Poisson process.
    3. To make the math easier, take the log of this likelihood function.
    4. Optimize this log-likelihood using calculus – what is the value of \(\lambda\) that results? Compute this value, given the data that you have.
    5. Maximize this log-likelihood numerically, and report the value for \(\lambda\) that produces the highest likelihood of seeing this data.
    6. Comment on your answers from parts 4 and 5. Are you surprised or not by what you see?
poisson_ll <- function(data, lambda) { 
  ## fill this in:  
  lambda # this is a placeholder, change this
}
search_space <- seq(0,100, by = 0.1)
plot(
  x = search_space, xlab = 'Search Space',
  y = poisson_ll(data=data, lambda=search_space), ylab = 'Log Likelihood',
  type = 'l'
)

# optimize(poisson_ll, lower = 0, upper = 100, data = data, maximum = TRUE)

Confidence Intervals

This exercise is meant to demonstrate what the confidence level in a confidence interval represents. We will assume a standard normal population distribution and simulate what happens when we draw a sample and compute a confidence interval.

Your task is to complete the following function so that it,

  1. simulates and storesn draws from a standard normal distribution
  2. based on those draws, computes a valid confidence interval with confidence level \(\alpha\), a parameter that you pass to the function.

Your function should return a vector of length 2, containing the lower bound and upper bound of the confidence interval.

\[ CI_{\alpha} = \overline{X} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]

where:

  • \(CI_\alpha\) is the confidence interval that you’re seeking to produce
  • \(\overline{X}\) is the sample average,
  • \(t_{\alpha/2}\) is your critical value (accessible through qt),
  • and \(s\) is your sample standard deviation. Notice that you’ll need each of these pieces in the code that you’re about to write.
sim_conf_int <- function(n, alpha) {
  # Fill in your code to: 
  # 1. simulate n draws from a standard normal dist.
  # 2. compute a confidence interval with confidence level alpha
  
  sample_draws <- 'fill this in'
  sample_mean  <- 'fill this in'
  sample_sd    <- 'fill this in'
  
  critical_t <- 'fill this in'
  
  ci_95 <- 'fill this in'
  
  return(ci_95)  
}

sim_conf_int(n = 100, alpha = 0.25)
## [1] "fill this in"

When your function is complete, you can use the following code to run your function 100 times and plot the results.

many_confidence_intervals <- function(num_simulations, n, alpha) {
  ## args: 
  ##  - num_simulations: the number of simulated confidence intervals 
  ##  - n: the number of observations in each simulation that will pass 
  ##           into your `sim_conf_int` function
  ##  - alpha: the confidence interval that you will pass into 
  ##           your `sim_conf_int` function
  
  results <- NULL
  for(i in 1:num_simulations) {
    interval = sim_conf_int(n, alpha)
    results = rbind(results, c(interval[1], interval[2], interval[1]<0 & interval[2]>0))
  }
  resultsdf = data.frame(results)
  names(resultsdf) = c("low", "high", "captured")
  return(resultsdf)
}

n = 20
confidence_intervals = many_confidence_intervals(100, n, .05)
plot_many_confidence_intervals <- function(c) { 
  plot(NULL, type = "n",
       xlim = c(1,100), xlab = 'Trial',
       ylim = c(min(c$low), max(c$high)), ylab=expression(mu),pch=19)

  abline(h = 0, col = 'gray')
  abline(h = qt(0.975, n-1)/sqrt(n), lty = 2, col = 'gray')
  abline(h = qt(0.025, n-1)/sqrt(n), lty = 2, col = 'gray')
  
  points(c$high, col = 2+c$captured, pch = 20)
  points(c$low,  col = 2+c$captured, pch = 20)
  for(i in 1:nrow(c)) {
    lines(c(i,i), c(c$low[i],c$high[i]), col = 2+c$captured[i], pch = 19)
  }
  
  title(expression(paste("Simulation of t-Confidence Intervals for ", mu,
                          " with Sample Size 20")))

  legend(0,-.65, legend = c(expression(paste(mu," Captured")),
                             expression(paste(mu," Not Captured"))), fill = c(3,2))
  }
# plot_many_confidence_intervals(confidence_intervals)
  1. How many of the simulated confidence intervals contain the true mean, zero?
  2. Suppose you run a single study. Based on what you’ve just written above, why is it incorrect to say that, “There is a 95% probability that the true mean is inside this (single) confidence interval”?

Maximum Likelihood Example: Printers

Part I

Suppose that you’ve got a particular sequence of values: \({1, 0, 0, 1, 0, 1, 1, 1, 1, 1}\) that indicate whether a printer any particular time you try to print.

You have data from the last 10 times you tried.

Question:

  • What is the probability (\(p\)) that the printer jams on the next print job?

bbc, office space

Part II

The data resembles draws from a Bernoulli distribution.

However, even if we want to model this as a Bernoulli distribution, we do not know the value of the parameter, \(p\).

1- Define your random variable.

2- Write down the likelihood function

3- If it will make the math easier, log the likelihood function.

4- Path 1: Maximize the likelihood using calculus

5- Path 2: Maximize using numeric methods.

Appendix

Bloom’s Taxonomy

An effective rubric for student understanding is attributed to Bloom (1956). Referred to as Bloom’s Taxonomy, this proposes that there is a hierarchy of student understanding; that a student may have one level of reasoning skill with a concept, but not another. The taxonomy proposes to be ordered: some levels of reasoning build upon other levels of reasoning.

In the learning objective that we present in for each live session, we will also identify the level of reasoning that we hope students will achieve at the conclusion of the live session.

  1. Remember A student can remember that the concept exists. This might require the student to define, duplicate, or memorize a set of concepts or facts.
  2. Understand A student can understand the concept, and can produce a working technical and non-technical statement of the concept. The student can explain why the concept is, or why the concept works in the way that it does.
  3. Apply A student can use the concept as it is intended to be used against a novel problem.
  4. Analyze A student can assess whether the concept has worked as it should have. This requires both an understanding of the intended goal, an application against a novel problem, and then the ability to introspect or reflect on whether the result is as it should be.
  5. Evaluate A student can analyze multiple approaches, and from this analysis evaluate whether one or another approach has better succeeded at achieving its goals.
  6. Create A student can create a new or novel method from axioms or experience, and can evaluate the performance of this new method against existing approaches or methods.

  1. UC Berkeley, School of Information↩︎

  2. Note, when we say “shaped” here, we’re referring to the deeper concept of a statistical functional. A statistical functional is a function of a function that maps to a real number. So, if \(T\) is the functional that we’re thinking of, \(\mathcal{F}\) is a family of functions that it might operate on, and \(\mathbb{R}\) is the set of real numbers, a statistical functional is just \(T: \mathcal{F} \rightarrow \mathbb{R}\). The Expectation statistical functional, \(E[X]\) always has the form \(\int x f_{X}(x)dx\).)↩︎

  3. Take a moment to strategize just a little bit before you get going on this one. There is a way to compute this value that is easier than another way to compute this value.↩︎